COVID-19 CAPSTONE PROJECT¶

Background¶

The SARS-CoV-2 virus, also known as, Coronavirus 2019 and popularly referred to as Covid-19 was first reported in 2019 and declared a pandemic in March, 2020. For two years, the world has battled the disease through traditional public health measures such as quarantine and social distancing as well as innovative technologies such as rapid testing and breakthrough vaccines. In areas with overwhelmed healthcare systems, it is impractical to test every patient that presents with flu-like symptoms. It is therefore important to set up a targeted testing criteria that will be effective in identifying positive Covid-19 cases.

Problem Statement¶

To predict chances of a positive or negative Covid-19 test result and identify the factors that influence these results by using a collection of laboratory tests from suspected cases.

Need for the study¶

In areas with overwhelmed healthcare systems, it is impractical to test every patient that presents with flu-like symptoms. It is therefore important to set up a targeted testing criteria that will be effective in identifying positive Covid-19 cases.

Business/ Social Opportunities¶

a) This will enable fair allocation of resources in the management of Covid-19 cases.

b) Developing an effective criteria will enable rapid detection of cases and reduce the disease burden by quickly initiating management of positive cases.

c) This algorithm could reduce hospital wait times and shorten queues in the waiting rooms and testing centers.

In [1]:
# To supress warnings
import warnings

warnings.filterwarnings("ignore")


# import libraries for data manipulation
import numpy as np
import pandas as pd

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# to scale the data using z-score
from sklearn.preprocessing import StandardScaler

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV


# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To tune different models
from sklearn.model_selection import GridSearchCV

# To perform statistical analysis
import scipy.stats as stats

# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    plot_confusion_matrix,
    make_scorer,precision_recall_curve,
    roc_curve, 
    roc_auc_score
)

# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score


from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier


# To be used for creating pipelines and personalizing them
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To undersample and oversample the data
!pip install imblearn
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
Requirement already satisfied: imblearn in /Users/kofori/opt/anaconda3/lib/python3.9/site-packages (0.0)
Requirement already satisfied: imbalanced-learn in /Users/kofori/opt/anaconda3/lib/python3.9/site-packages (from imblearn) (0.10.1)
Requirement already satisfied: numpy>=1.17.3 in /Users/kofori/opt/anaconda3/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.21.5)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/kofori/opt/anaconda3/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (2.2.0)
Requirement already satisfied: joblib>=1.1.1 in /Users/kofori/opt/anaconda3/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.2.0)
Requirement already satisfied: scikit-learn>=1.0.2 in /Users/kofori/opt/anaconda3/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.0.2)
Requirement already satisfied: scipy>=1.3.2 in /Users/kofori/opt/anaconda3/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.9.1)
In [2]:
data=pd.read_excel('covid19_dataset.xlsx')
In [3]:
df=data.copy()
df.head()
Out[3]:
Patient ID Patient age quantile SARS-Cov-2 exam result Patient addmited to regular ward (1=yes, 0=no) Patient addmited to semi-intensive unit (1=yes, 0=no) Patient addmited to intensive care unit (1=yes, 0=no) Hematocrit Hemoglobin Platelets Mean platelet volume Red blood Cells Lymphocytes Mean corpuscular hemoglobin concentration (MCHC) Leukocytes Basophils Mean corpuscular hemoglobin (MCH) Eosinophils Mean corpuscular volume (MCV) Monocytes Red blood cell distribution width (RDW) Serum Glucose Respiratory Syncytial Virus Influenza A Influenza B Parainfluenza 1 CoronavirusNL63 Rhinovirus/Enterovirus Mycoplasma pneumoniae Coronavirus HKU1 Parainfluenza 3 Chlamydophila pneumoniae Adenovirus Parainfluenza 4 Coronavirus229E CoronavirusOC43 Inf A H1N1 2009 Bordetella pertussis Metapneumovirus Parainfluenza 2 Neutrophils Urea Proteina C reativa mg/dL Creatinine Potassium Sodium Influenza B, rapid test Influenza A, rapid test Alanine transaminase Aspartate transaminase Gamma-glutamyltransferase Total Bilirubin Direct Bilirubin Indirect Bilirubin Alkaline phosphatase Ionized calcium Strepto A Magnesium pCO2 (venous blood gas analysis) Hb saturation (venous blood gas analysis) Base excess (venous blood gas analysis) pO2 (venous blood gas analysis) Fio2 (venous blood gas analysis) Total CO2 (venous blood gas analysis) pH (venous blood gas analysis) HCO3 (venous blood gas analysis) Rods # Segmented Promyelocytes Metamyelocytes Myelocytes Myeloblasts Urine - Esterase Urine - Aspect Urine - pH Urine - Hemoglobin Urine - Bile pigments Urine - Ketone Bodies Urine - Nitrite Urine - Density Urine - Urobilinogen Urine - Protein Urine - Sugar Urine - Leukocytes Urine - Crystals Urine - Red blood cells Urine - Hyaline cylinders Urine - Granular cylinders Urine - Yeasts Urine - Color Partial thromboplastin time (PTT) Relationship (Patient/Normal) International normalized ratio (INR) Lactic Dehydrogenase Prothrombin time (PT), Activity Vitamin B12 Creatine phosphokinase (CPK) Ferritin Arterial Lactic Acid Lipase dosage D-Dimer Albumin Hb saturation (arterial blood gases) pCO2 (arterial blood gas analysis) Base excess (arterial blood gas analysis) pH (arterial blood gas analysis) Total CO2 (arterial blood gas analysis) HCO3 (arterial blood gas analysis) pO2 (arterial blood gas analysis) Arteiral Fio2 Phosphor ctO2 (arterial blood gas analysis)
0 44477f75e8169d2 13 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 126e9dd13932f68 17 negative 0 0 0 0.237 -0.022 -0.517 0.011 0.102 0.318 -0.951 -0.095 -0.224 -0.292 1.482 0.166 0.358 -0.625 -0.141 not_detected not_detected not_detected not_detected not_detected detected NaN not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected -0.619 1.198 -0.148 2.090 -0.306 0.863 negative negative NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 a46b4402a0e5696 8 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 f7d619a94f97c45 5 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 d9e41465789c2b5 15 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN not_detected not_detected not_detected not_detected not_detected detected NaN not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [4]:
# viewing a random sample of the dataset
df.sample(n=10, random_state=1)
Out[4]:
Patient ID Patient age quantile SARS-Cov-2 exam result Patient addmited to regular ward (1=yes, 0=no) Patient addmited to semi-intensive unit (1=yes, 0=no) Patient addmited to intensive care unit (1=yes, 0=no) Hematocrit Hemoglobin Platelets Mean platelet volume Red blood Cells Lymphocytes Mean corpuscular hemoglobin concentration (MCHC) Leukocytes Basophils Mean corpuscular hemoglobin (MCH) Eosinophils Mean corpuscular volume (MCV) Monocytes Red blood cell distribution width (RDW) Serum Glucose Respiratory Syncytial Virus Influenza A Influenza B Parainfluenza 1 CoronavirusNL63 Rhinovirus/Enterovirus Mycoplasma pneumoniae Coronavirus HKU1 Parainfluenza 3 Chlamydophila pneumoniae Adenovirus Parainfluenza 4 Coronavirus229E CoronavirusOC43 Inf A H1N1 2009 Bordetella pertussis Metapneumovirus Parainfluenza 2 Neutrophils Urea Proteina C reativa mg/dL Creatinine Potassium Sodium Influenza B, rapid test Influenza A, rapid test Alanine transaminase Aspartate transaminase Gamma-glutamyltransferase Total Bilirubin Direct Bilirubin Indirect Bilirubin Alkaline phosphatase Ionized calcium Strepto A Magnesium pCO2 (venous blood gas analysis) Hb saturation (venous blood gas analysis) Base excess (venous blood gas analysis) pO2 (venous blood gas analysis) Fio2 (venous blood gas analysis) Total CO2 (venous blood gas analysis) pH (venous blood gas analysis) HCO3 (venous blood gas analysis) Rods # Segmented Promyelocytes Metamyelocytes Myelocytes Myeloblasts Urine - Esterase Urine - Aspect Urine - pH Urine - Hemoglobin Urine - Bile pigments Urine - Ketone Bodies Urine - Nitrite Urine - Density Urine - Urobilinogen Urine - Protein Urine - Sugar Urine - Leukocytes Urine - Crystals Urine - Red blood cells Urine - Hyaline cylinders Urine - Granular cylinders Urine - Yeasts Urine - Color Partial thromboplastin time (PTT) Relationship (Patient/Normal) International normalized ratio (INR) Lactic Dehydrogenase Prothrombin time (PT), Activity Vitamin B12 Creatine phosphokinase (CPK) Ferritin Arterial Lactic Acid Lipase dosage D-Dimer Albumin Hb saturation (arterial blood gases) pCO2 (arterial blood gas analysis) Base excess (arterial blood gas analysis) pH (arterial blood gas analysis) Total CO2 (arterial blood gas analysis) HCO3 (arterial blood gas analysis) pO2 (arterial blood gas analysis) Arteiral Fio2 Phosphor ctO2 (arterial blood gas analysis)
4441 b7c8bff333721c1 12 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN negative NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1603 484d8a9c71f01d2 1 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1206 1f3c363371d0462 10 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1586 938004044cac19f 6 positive 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2730 2e4ddd5e391680f 16 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3205 e030e21895c4929 9 negative 0 0 0 0.191 0.228 0.965 -0.438 0.031 1.461 0.244 0.573 -1.140 0.283 0.007 0.226 -0.746 -1.244 -0.413 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.454 -0.533 0.091 -1.789 0.863 negative negative NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -0.307 -0.832 -0.102 -0.316 -0.233 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5321 cf53da64a0eb988 10 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
943 65f8331bccab88d 17 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5029 968bd25963663dc 10 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1998 15e96beeae631ab 1 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Methodology¶

The data from this dataset seems to have been collected from patient medical records.

It is likely a secondary source of data.

Data Overview¶

Observations and Sanity Checks¶

In [5]:
#Code to asscertain number of rows and columns
df.shape
Out[5]:
(5644, 111)
In [6]:
# Use info() to print a concise summary of the DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5644 entries, 0 to 5643
Columns: 111 entries, Patient ID to ctO2 (arterial blood gas analysis)
dtypes: float64(70), int64(4), object(37)
memory usage: 4.8+ MB

Observations:¶

  • There are 111 columns and 5644 rows.
  • 70 columns consist of float data types
  • 4 columns contain data in the form of integers.
  • 37 columns contain object data types.
  • Total memory used in this format is 4.8+MB
In [7]:
df.isna().sum()
Out[7]:
Patient ID                                                  0
Patient age quantile                                        0
SARS-Cov-2 exam result                                      0
Patient addmited to regular ward (1=yes, 0=no)              0
Patient addmited to semi-intensive unit (1=yes, 0=no)       0
Patient addmited to intensive care unit (1=yes, 0=no)       0
Hematocrit                                               5041
Hemoglobin                                               5041
Platelets                                                5042
Mean platelet volume                                     5045
Red blood Cells                                          5042
Lymphocytes                                              5042
Mean corpuscular hemoglobin concentration (MCHC)         5042
Leukocytes                                               5042
Basophils                                                5042
Mean corpuscular hemoglobin (MCH)                        5042
Eosinophils                                              5042
Mean corpuscular volume (MCV)                            5042
Monocytes                                                5043
Red blood cell distribution width (RDW)                  5042
Serum Glucose                                            5436
Respiratory Syncytial Virus                              4290
Influenza A                                              4290
Influenza B                                              4290
Parainfluenza 1                                          4292
CoronavirusNL63                                          4292
Rhinovirus/Enterovirus                                   4292
Mycoplasma pneumoniae                                    5644
Coronavirus HKU1                                         4292
Parainfluenza 3                                          4292
Chlamydophila pneumoniae                                 4292
Adenovirus                                               4292
Parainfluenza 4                                          4292
Coronavirus229E                                          4292
CoronavirusOC43                                          4292
Inf A H1N1 2009                                          4292
Bordetella pertussis                                     4292
Metapneumovirus                                          4292
Parainfluenza 2                                          4292
Neutrophils                                              5131
Urea                                                     5247
Proteina C reativa mg/dL                                 5138
Creatinine                                               5220
Potassium                                                5273
Sodium                                                   5274
Influenza B, rapid test                                  4824
Influenza A, rapid test                                  4824
Alanine transaminase                                     5419
Aspartate transaminase                                   5418
Gamma-glutamyltransferase                                5491
Total Bilirubin                                          5462
Direct Bilirubin                                         5462
Indirect Bilirubin                                       5462
Alkaline phosphatase                                     5500
Ionized calcium                                          5594
Strepto A                                                5312
Magnesium                                                5604
pCO2 (venous blood gas analysis)                         5508
Hb saturation (venous blood gas analysis)                5508
Base excess (venous blood gas analysis)                  5508
pO2 (venous blood gas analysis)                          5508
Fio2 (venous blood gas analysis)                         5643
Total CO2 (venous blood gas analysis)                    5508
pH (venous blood gas analysis)                           5508
HCO3 (venous blood gas analysis)                         5508
Rods #                                                   5547
Segmented                                                5547
Promyelocytes                                            5547
Metamyelocytes                                           5547
Myelocytes                                               5547
Myeloblasts                                              5547
Urine - Esterase                                         5584
Urine - Aspect                                           5574
Urine - pH                                               5574
Urine - Hemoglobin                                       5574
Urine - Bile pigments                                    5574
Urine - Ketone Bodies                                    5587
Urine - Nitrite                                          5643
Urine - Density                                          5574
Urine - Urobilinogen                                     5575
Urine - Protein                                          5584
Urine - Sugar                                            5644
Urine - Leukocytes                                       5574
Urine - Crystals                                         5574
Urine - Red blood cells                                  5574
Urine - Hyaline cylinders                                5577
Urine - Granular cylinders                               5575
Urine - Yeasts                                           5574
Urine - Color                                            5574
Partial thromboplastin time (PTT)                        5644
Relationship (Patient/Normal)                            5553
International normalized ratio (INR)                     5511
Lactic Dehydrogenase                                     5543
Prothrombin time (PT), Activity                          5644
Vitamin B12                                              5641
Creatine phosphokinase (CPK)                             5540
Ferritin                                                 5621
Arterial Lactic Acid                                     5617
Lipase dosage                                            5636
D-Dimer                                                  5644
Albumin                                                  5631
Hb saturation (arterial blood gases)                     5617
pCO2 (arterial blood gas analysis)                       5617
Base excess (arterial blood gas analysis)                5617
pH (arterial blood gas analysis)                         5617
Total CO2 (arterial blood gas analysis)                  5617
HCO3 (arterial blood gas analysis)                       5617
pO2 (arterial blood gas analysis)                        5617
Arteiral Fio2                                            5624
Phosphor                                                 5624
ctO2 (arterial blood gas analysis)                       5617
dtype: int64

Comment:¶

The dataset contains missing data.

In [8]:
# checking for duplicate values
df.duplicated().sum()
Out[8]:
0

Comment:¶

There are no duplicated values in this dataset

In [9]:
#Code to check for number of unique values in each data set
df.nunique()
Out[9]:
Patient ID                                               5644
Patient age quantile                                       20
SARS-Cov-2 exam result                                      2
Patient addmited to regular ward (1=yes, 0=no)              2
Patient addmited to semi-intensive unit (1=yes, 0=no)       2
Patient addmited to intensive care unit (1=yes, 0=no)       2
Hematocrit                                                176
Hemoglobin                                                 84
Platelets                                                 249
Mean platelet volume                                       48
Red blood Cells                                           211
Lymphocytes                                               318
Mean corpuscular hemoglobin concentration (MCHC)           57
Leukocytes                                                475
Basophils                                                  17
Mean corpuscular hemoglobin (MCH)                          91
Eosinophils                                                86
Mean corpuscular volume (MCV)                             190
Monocytes                                                 146
Red blood cell distribution width (RDW)                    61
Serum Glucose                                              71
Respiratory Syncytial Virus                                 2
Influenza A                                                 2
Influenza B                                                 2
Parainfluenza 1                                             2
CoronavirusNL63                                             2
Rhinovirus/Enterovirus                                      2
Mycoplasma pneumoniae                                       0
Coronavirus HKU1                                            2
Parainfluenza 3                                             2
Chlamydophila pneumoniae                                    2
Adenovirus                                                  2
Parainfluenza 4                                             2
Coronavirus229E                                             2
CoronavirusOC43                                             2
Inf A H1N1 2009                                             2
Bordetella pertussis                                        2
Metapneumovirus                                             2
Parainfluenza 2                                             1
Neutrophils                                               308
Urea                                                       54
Proteina C reativa mg/dL                                  265
Creatinine                                                119
Potassium                                                  22
Sodium                                                     19
Influenza B, rapid test                                     2
Influenza A, rapid test                                     2
Alanine transaminase                                       62
Aspartate transaminase                                     51
Gamma-glutamyltransferase                                  70
Total Bilirubin                                            19
Direct Bilirubin                                           10
Indirect Bilirubin                                         10
Alkaline phosphatase                                       82
Ionized calcium                                            20
Strepto A                                                   3
Magnesium                                                   9
pCO2 (venous blood gas analysis)                           97
Hb saturation (venous blood gas analysis)                 120
Base excess (venous blood gas analysis)                    72
pO2 (venous blood gas analysis)                           121
Fio2 (venous blood gas analysis)                            1
Total CO2 (venous blood gas analysis)                      78
pH (venous blood gas analysis)                             89
HCO3 (venous blood gas analysis)                           78
Rods #                                                     15
Segmented                                                  55
Promyelocytes                                               2
Metamyelocytes                                              4
Myelocytes                                                  4
Myeloblasts                                                 1
Urine - Esterase                                            2
Urine - Aspect                                              4
Urine - pH                                                 15
Urine - Hemoglobin                                          3
Urine - Bile pigments                                       2
Urine - Ketone Bodies                                       2
Urine - Nitrite                                             1
Urine - Density                                            24
Urine - Urobilinogen                                        2
Urine - Protein                                             2
Urine - Sugar                                               0
Urine - Leukocytes                                         31
Urine - Crystals                                            5
Urine - Red blood cells                                    32
Urine - Hyaline cylinders                                   1
Urine - Granular cylinders                                  1
Urine - Yeasts                                              1
Urine - Color                                               4
Partial thromboplastin time (PTT)                           0
Relationship (Patient/Normal)                              35
International normalized ratio (INR)                       42
Lactic Dehydrogenase                                       79
Prothrombin time (PT), Activity                             0
Vitamin B12                                                 3
Creatine phosphokinase (CPK)                               77
Ferritin                                                   23
Arterial Lactic Acid                                       13
Lipase dosage                                               7
D-Dimer                                                     0
Albumin                                                    10
Hb saturation (arterial blood gases)                       23
pCO2 (arterial blood gas analysis)                         25
Base excess (arterial blood gas analysis)                  20
pH (arterial blood gas analysis)                           24
Total CO2 (arterial blood gas analysis)                    24
HCO3 (arterial blood gas analysis)                         23
pO2 (arterial blood gas analysis)                          27
Arteiral Fio2                                               9
Phosphor                                                   16
ctO2 (arterial blood gas analysis)                         19
dtype: int64

Comment:¶

There are 5644 unique patients.

In [10]:
# statistical summary of the data
df.describe(include="all").T
Out[10]:
count unique top freq mean std min 25% 50% 75% max
Patient ID 5644 5644 44477f75e8169d2 1 NaN NaN NaN NaN NaN NaN NaN
Patient age quantile 5644.000 NaN NaN NaN 9.318 5.778 0.000 4.000 9.000 14.000 19.000
SARS-Cov-2 exam result 5644 2 negative 5086 NaN NaN NaN NaN NaN NaN NaN
Patient addmited to regular ward (1=yes, 0=no) 5644.000 NaN NaN NaN 0.014 0.117 0.000 0.000 0.000 0.000 1.000
Patient addmited to semi-intensive unit (1=yes, 0=no) 5644.000 NaN NaN NaN 0.009 0.094 0.000 0.000 0.000 0.000 1.000
Patient addmited to intensive care unit (1=yes, 0=no) 5644.000 NaN NaN NaN 0.007 0.085 0.000 0.000 0.000 0.000 1.000
Hematocrit 603.000 NaN NaN NaN -0.000 1.001 -4.501 -0.519 0.053 0.717 2.663
Hemoglobin 603.000 NaN NaN NaN -0.000 1.001 -4.346 -0.586 0.040 0.730 2.672
Platelets 602.000 NaN NaN NaN -0.000 1.001 -2.552 -0.605 -0.122 0.531 9.532
Mean platelet volume 599.000 NaN NaN NaN 0.000 1.001 -2.458 -0.662 -0.102 0.684 3.713
Red blood Cells 602.000 NaN NaN NaN 0.000 1.001 -3.971 -0.568 0.014 0.666 3.646
Lymphocytes 602.000 NaN NaN NaN -0.000 1.001 -1.865 -0.731 -0.014 0.598 3.764
Mean corpuscular hemoglobin concentration (MCHC) 602.000 NaN NaN NaN 0.000 1.001 -5.432 -0.552 -0.055 0.642 3.331
Leukocytes 602.000 NaN NaN NaN 0.000 1.001 -2.020 -0.637 -0.213 0.454 4.522
Basophils 602.000 NaN NaN NaN -0.000 1.001 -1.140 -0.529 -0.224 0.387 11.078
Mean corpuscular hemoglobin (MCH) 602.000 NaN NaN NaN -0.000 1.001 -5.938 -0.501 0.126 0.596 4.099
Eosinophils 602.000 NaN NaN NaN 0.000 1.001 -0.836 -0.667 -0.330 0.344 8.351
Mean corpuscular volume (MCV) 602.000 NaN NaN NaN -0.000 1.001 -5.102 -0.515 0.066 0.627 3.411
Monocytes 601.000 NaN NaN NaN -0.000 1.001 -2.164 -0.614 -0.115 0.489 4.533
Red blood cell distribution width (RDW) 602.000 NaN NaN NaN 0.000 1.001 -1.598 -0.625 -0.183 0.348 6.982
Serum Glucose 208.000 NaN NaN NaN 0.000 1.002 -1.110 -0.504 -0.292 0.139 7.006
Respiratory Syncytial Virus 1354 2 not_detected 1302 NaN NaN NaN NaN NaN NaN NaN
Influenza A 1354 2 not_detected 1336 NaN NaN NaN NaN NaN NaN NaN
Influenza B 1354 2 not_detected 1277 NaN NaN NaN NaN NaN NaN NaN
Parainfluenza 1 1352 2 not_detected 1349 NaN NaN NaN NaN NaN NaN NaN
CoronavirusNL63 1352 2 not_detected 1307 NaN NaN NaN NaN NaN NaN NaN
Rhinovirus/Enterovirus 1352 2 not_detected 973 NaN NaN NaN NaN NaN NaN NaN
Mycoplasma pneumoniae 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Coronavirus HKU1 1352 2 not_detected 1332 NaN NaN NaN NaN NaN NaN NaN
Parainfluenza 3 1352 2 not_detected 1342 NaN NaN NaN NaN NaN NaN NaN
Chlamydophila pneumoniae 1352 2 not_detected 1343 NaN NaN NaN NaN NaN NaN NaN
Adenovirus 1352 2 not_detected 1339 NaN NaN NaN NaN NaN NaN NaN
Parainfluenza 4 1352 2 not_detected 1333 NaN NaN NaN NaN NaN NaN NaN
Coronavirus229E 1352 2 not_detected 1343 NaN NaN NaN NaN NaN NaN NaN
CoronavirusOC43 1352 2 not_detected 1344 NaN NaN NaN NaN NaN NaN NaN
Inf A H1N1 2009 1352 2 not_detected 1254 NaN NaN NaN NaN NaN NaN NaN
Bordetella pertussis 1352 2 not_detected 1350 NaN NaN NaN NaN NaN NaN NaN
Metapneumovirus 1352 2 not_detected 1338 NaN NaN NaN NaN NaN NaN NaN
Parainfluenza 2 1352 1 not_detected 1352 NaN NaN NaN NaN NaN NaN NaN
Neutrophils 513.000 NaN NaN NaN 0.000 1.001 -3.340 -0.652 -0.054 0.684 2.536
Urea 397.000 NaN NaN NaN -0.000 1.001 -1.630 -0.588 -0.142 0.454 11.247
Proteina C reativa mg/dL 506.000 NaN NaN NaN 0.000 1.001 -0.535 -0.514 -0.394 0.032 8.027
Creatinine 424.000 NaN NaN NaN -0.000 1.001 -2.390 -0.632 -0.081 0.513 5.054
Potassium 371.000 NaN NaN NaN 0.000 1.001 -2.283 -0.800 -0.059 0.683 3.402
Sodium 370.000 NaN NaN NaN 0.000 1.001 -5.247 -0.575 0.144 0.503 4.097
Influenza B, rapid test 820 2 negative 771 NaN NaN NaN NaN NaN NaN NaN
Influenza A, rapid test 820 2 negative 768 NaN NaN NaN NaN NaN NaN NaN
Alanine transaminase 225.000 NaN NaN NaN 0.000 1.002 -0.642 -0.449 -0.284 0.102 7.931
Aspartate transaminase 226.000 NaN NaN NaN -0.000 1.002 -0.704 -0.433 -0.278 0.031 7.231
Gamma-glutamyltransferase 153.000 NaN NaN NaN -0.000 1.003 -0.477 -0.376 -0.286 -0.061 8.508
Total Bilirubin 182.000 NaN NaN NaN -0.000 1.003 -1.093 -0.787 -0.175 0.131 5.029
Direct Bilirubin 182.000 NaN NaN NaN 0.000 1.003 -1.170 -0.586 -0.003 -0.003 6.996
Indirect Bilirubin 182.000 NaN NaN NaN 0.000 1.003 -0.771 -0.771 -0.279 0.214 6.615
Alkaline phosphatase 144.000 NaN NaN NaN -0.000 1.003 -0.959 -0.609 -0.358 0.054 3.883
Ionized calcium 50.000 NaN NaN NaN 0.000 1.010 -2.100 -0.729 0.060 0.558 3.549
Strepto A 332 3 negative 297 NaN NaN NaN NaN NaN NaN NaN
Magnesium 40.000 NaN NaN NaN -0.000 1.013 -2.191 -0.558 -0.014 0.531 2.164
pCO2 (venous blood gas analysis) 136.000 NaN NaN NaN -0.000 1.004 -2.705 -0.547 0.014 0.619 5.680
Hb saturation (venous blood gas analysis) 136.000 NaN NaN NaN 0.000 1.004 -2.296 -0.803 0.090 0.817 1.708
Base excess (venous blood gas analysis) 136.000 NaN NaN NaN -0.000 1.004 -3.669 -0.402 0.080 0.554 3.357
pO2 (venous blood gas analysis) 136.000 NaN NaN NaN -0.000 1.004 -1.634 -0.694 -0.213 0.483 3.775
Fio2 (venous blood gas analysis) 1.000 NaN NaN NaN 0.000 NaN 0.000 0.000 0.000 0.000 0.000
Total CO2 (venous blood gas analysis) 136.000 NaN NaN NaN -0.000 1.004 -2.598 -0.495 0.104 0.542 3.021
pH (venous blood gas analysis) 136.000 NaN NaN NaN 0.000 1.004 -4.773 -0.526 -0.091 0.490 2.790
HCO3 (venous blood gas analysis) 136.000 NaN NaN NaN -0.000 1.004 -2.645 -0.529 0.101 0.529 2.782
Rods # 97.000 NaN NaN NaN 0.000 1.005 -0.624 -0.624 -0.624 0.326 3.496
Segmented 97.000 NaN NaN NaN -0.000 1.005 -2.264 -0.673 0.176 0.919 1.502
Promyelocytes 97.000 NaN NaN NaN 0.000 1.005 -0.102 -0.102 -0.102 -0.102 9.798
Metamyelocytes 97.000 NaN NaN NaN 0.000 1.005 -0.316 -0.316 -0.316 -0.316 6.136
Myelocytes 97.000 NaN NaN NaN 0.000 1.005 -0.233 -0.233 -0.233 -0.233 6.551
Myeloblasts 97.000 NaN NaN NaN 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Urine - Esterase 60 2 absent 59 NaN NaN NaN NaN NaN NaN NaN
Urine - Aspect 70 4 clear 61 NaN NaN NaN NaN NaN NaN NaN
Urine - pH 70 15 5.0 14 NaN NaN NaN NaN NaN NaN NaN
Urine - Hemoglobin 70 3 absent 53 NaN NaN NaN NaN NaN NaN NaN
Urine - Bile pigments 70 2 absent 69 NaN NaN NaN NaN NaN NaN NaN
Urine - Ketone Bodies 57 2 absent 56 NaN NaN NaN NaN NaN NaN NaN
Urine - Nitrite 1 1 not_done 1 NaN NaN NaN NaN NaN NaN NaN
Urine - Density 70.000 NaN NaN NaN -0.000 1.007 -1.757 -0.764 -0.055 0.655 2.499
Urine - Urobilinogen 69 2 normal 68 NaN NaN NaN NaN NaN NaN NaN
Urine - Protein 60 2 absent 59 NaN NaN NaN NaN NaN NaN NaN
Urine - Sugar 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Urine - Leukocytes 70 31 <1000 9 NaN NaN NaN NaN NaN NaN NaN
Urine - Crystals 70 5 Ausentes 65 NaN NaN NaN NaN NaN NaN NaN
Urine - Red blood cells 70.000 NaN NaN NaN 0.000 1.007 -0.202 -0.202 -0.194 -0.166 7.822
Urine - Hyaline cylinders 67 1 absent 67 NaN NaN NaN NaN NaN NaN NaN
Urine - Granular cylinders 69 1 absent 69 NaN NaN NaN NaN NaN NaN NaN
Urine - Yeasts 70 1 absent 70 NaN NaN NaN NaN NaN NaN NaN
Urine - Color 70 4 yellow 55 NaN NaN NaN NaN NaN NaN NaN
Partial thromboplastin time (PTT) 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Relationship (Patient/Normal) 91.000 NaN NaN NaN -0.000 1.006 -2.351 -0.497 -0.089 0.453 4.706
International normalized ratio (INR) 133.000 NaN NaN NaN -0.000 1.004 -1.797 -0.665 -0.156 0.297 7.370
Lactic Dehydrogenase 101.000 NaN NaN NaN 0.000 1.005 -1.359 -0.700 -0.331 0.473 2.950
Prothrombin time (PT), Activity 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Vitamin B12 3.000 NaN NaN NaN -0.000 1.225 -1.401 -0.435 0.531 0.700 0.870
Creatine phosphokinase (CPK) 104.000 NaN NaN NaN -0.000 1.005 -0.516 -0.377 -0.225 0.035 7.216
Ferritin 23.000 NaN NaN NaN 0.000 1.022 -0.628 -0.560 -0.358 0.120 3.846
Arterial Lactic Acid 27.000 NaN NaN NaN -0.000 1.019 -1.091 -0.695 -0.298 0.230 3.004
Lipase dosage 8.000 NaN NaN NaN -0.000 1.069 -1.192 -0.547 -0.351 0.182 1.725
D-Dimer 0.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Albumin 13.000 NaN NaN NaN -0.000 1.041 -2.290 -0.539 -0.038 0.462 1.963
Hb saturation (arterial blood gases) 27.000 NaN NaN NaN -0.000 1.019 -2.000 -1.123 0.268 0.738 1.337
pCO2 (arterial blood gas analysis) 27.000 NaN NaN NaN 0.000 1.019 -1.245 -0.535 -0.212 0.023 3.237
Base excess (arterial blood gas analysis) 27.000 NaN NaN NaN -0.000 1.019 -3.083 -0.331 -0.012 0.666 1.703
pH (arterial blood gas analysis) 27.000 NaN NaN NaN 0.000 1.019 -3.569 -0.092 0.294 0.512 1.043
Total CO2 (arterial blood gas analysis) 27.000 NaN NaN NaN -0.000 1.019 -2.926 -0.512 0.077 0.439 1.940
HCO3 (arterial blood gas analysis) 27.000 NaN NaN NaN 0.000 1.019 -2.986 -0.540 0.056 0.509 2.029
pO2 (arterial blood gas analysis) 27.000 NaN NaN NaN -0.000 1.019 -1.176 -0.817 -0.160 0.450 2.205
Arteiral Fio2 20.000 NaN NaN NaN 0.000 1.026 -1.533 -0.121 -0.012 -0.012 2.842
Phosphor 20.000 NaN NaN NaN 0.000 1.026 -1.481 -0.553 -0.138 0.276 2.862
ctO2 (arterial blood gas analysis) 27.000 NaN NaN NaN 0.000 1.019 -2.900 -0.485 0.183 0.594 1.827

Comment:¶

  • The mean value for bicarbonate levels in arterial blood gas analysis is 0.
  • The mean partial pressure of oxygen in arterial blood gas analysis is 0.
  • The mean serum phosphorous level is 0.
  • There is a need for some data cleaning as some of the statistical summaries for some variables such as Patient ID and Yes/No columns do not provide meaningful insight.

Initial Exploratory Data Analysis¶

In [11]:
# Columns to be analyzed
df.columns.tolist()
Out[11]:
['Patient ID',
 'Patient age quantile',
 'SARS-Cov-2 exam result',
 'Patient addmited to regular ward (1=yes, 0=no)',
 'Patient addmited to semi-intensive unit (1=yes, 0=no)',
 'Patient addmited to intensive care unit (1=yes, 0=no)',
 'Hematocrit',
 'Hemoglobin',
 'Platelets',
 'Mean platelet volume ',
 'Red blood Cells',
 'Lymphocytes',
 'Mean corpuscular hemoglobin concentration\xa0(MCHC)',
 'Leukocytes',
 'Basophils',
 'Mean corpuscular hemoglobin (MCH)',
 'Eosinophils',
 'Mean corpuscular volume (MCV)',
 'Monocytes',
 'Red blood cell distribution width (RDW)',
 'Serum Glucose',
 'Respiratory Syncytial Virus',
 'Influenza A',
 'Influenza B',
 'Parainfluenza 1',
 'CoronavirusNL63',
 'Rhinovirus/Enterovirus',
 'Mycoplasma pneumoniae',
 'Coronavirus HKU1',
 'Parainfluenza 3',
 'Chlamydophila pneumoniae',
 'Adenovirus',
 'Parainfluenza 4',
 'Coronavirus229E',
 'CoronavirusOC43',
 'Inf A H1N1 2009',
 'Bordetella pertussis',
 'Metapneumovirus',
 'Parainfluenza 2',
 'Neutrophils',
 'Urea',
 'Proteina C reativa mg/dL',
 'Creatinine',
 'Potassium',
 'Sodium',
 'Influenza B, rapid test',
 'Influenza A, rapid test',
 'Alanine transaminase',
 'Aspartate transaminase',
 'Gamma-glutamyltransferase\xa0',
 'Total Bilirubin',
 'Direct Bilirubin',
 'Indirect Bilirubin',
 'Alkaline phosphatase',
 'Ionized calcium\xa0',
 'Strepto A',
 'Magnesium',
 'pCO2 (venous blood gas analysis)',
 'Hb saturation (venous blood gas analysis)',
 'Base excess (venous blood gas analysis)',
 'pO2 (venous blood gas analysis)',
 'Fio2 (venous blood gas analysis)',
 'Total CO2 (venous blood gas analysis)',
 'pH (venous blood gas analysis)',
 'HCO3 (venous blood gas analysis)',
 'Rods #',
 'Segmented',
 'Promyelocytes',
 'Metamyelocytes',
 'Myelocytes',
 'Myeloblasts',
 'Urine - Esterase',
 'Urine - Aspect',
 'Urine - pH',
 'Urine - Hemoglobin',
 'Urine - Bile pigments',
 'Urine - Ketone Bodies',
 'Urine - Nitrite',
 'Urine - Density',
 'Urine - Urobilinogen',
 'Urine - Protein',
 'Urine - Sugar',
 'Urine - Leukocytes',
 'Urine - Crystals',
 'Urine - Red blood cells',
 'Urine - Hyaline cylinders',
 'Urine - Granular cylinders',
 'Urine - Yeasts',
 'Urine - Color',
 'Partial thromboplastin time\xa0(PTT)\xa0',
 'Relationship (Patient/Normal)',
 'International normalized ratio (INR)',
 'Lactic Dehydrogenase',
 'Prothrombin time (PT), Activity',
 'Vitamin B12',
 'Creatine phosphokinase\xa0(CPK)\xa0',
 'Ferritin',
 'Arterial Lactic Acid',
 'Lipase dosage',
 'D-Dimer',
 'Albumin',
 'Hb saturation (arterial blood gases)',
 'pCO2 (arterial blood gas analysis)',
 'Base excess (arterial blood gas analysis)',
 'pH (arterial blood gas analysis)',
 'Total CO2 (arterial blood gas analysis)',
 'HCO3 (arterial blood gas analysis)',
 'pO2 (arterial blood gas analysis)',
 'Arteiral Fio2',
 'Phosphor',
 'ctO2 (arterial blood gas analysis)']
In [12]:
# Create a subset of the df dataframe comprising continuous data variables
df1=df[['Patient age quantile','Hematocrit', 'Hemoglobin', 'Platelets','Red blood Cells',
 'Lymphocytes',
 'Mean corpuscular hemoglobin concentration\xa0(MCHC)',
 'Leukocytes',
 'Basophils',
 'Mean corpuscular hemoglobin (MCH)',
 'Eosinophils',
 'Mean corpuscular volume (MCV)',
 'Monocytes',
 'Red blood cell distribution width (RDW)',
 'Serum Glucose',
 'Neutrophils',
 'Urea',
 'Proteina C reativa mg/dL',
 'Creatinine',
 'Potassium',
 'Sodium',
 'Aspartate transaminase',
 'Gamma-glutamyltransferase\xa0',
 'Total Bilirubin',
 'Direct Bilirubin',
 'Indirect Bilirubin',
 'Alkaline phosphatase',
 'Ionized calcium\xa0',
 'Magnesium',
 'pCO2 (venous blood gas analysis)',
 'Hb saturation (venous blood gas analysis)',
 'Base excess (venous blood gas analysis)',
 'pO2 (venous blood gas analysis)',
 'Fio2 (venous blood gas analysis)',
 'Total CO2 (venous blood gas analysis)',
 'pH (venous blood gas analysis)',
 'HCO3 (venous blood gas analysis)',
 'Rods #',
 'Segmented',
 'Promyelocytes',
 'Metamyelocytes',
 'Myelocytes',
 'Myeloblasts',
 'Relationship (Patient/Normal)',
 'International normalized ratio (INR)',
 'Lactic Dehydrogenase',
 'Vitamin B12',
 'Creatine phosphokinase\xa0(CPK)\xa0',
 'Ferritin','Urine - Red blood cells',
 'Arterial Lactic Acid',
 'Lipase dosage',
 'Albumin', 'Hb saturation (arterial blood gases)', 'pCO2 (arterial blood gas analysis)', 'Base excess (arterial blood gas analysis)', 'pH (arterial blood gas analysis)', 'Total CO2 (arterial blood gas analysis)', 'HCO3 (arterial blood gas analysis)', 'pO2 (arterial blood gas analysis)','Arteiral Fio2','Phosphor', 'ctO2 (arterial blood gas analysis)']]
In [13]:
df2=df[['SARS-Cov-2 exam result',
 'Patient addmited to regular ward (1=yes, 0=no)',
 'Patient addmited to semi-intensive unit (1=yes, 0=no)',
 'Patient addmited to intensive care unit (1=yes, 0=no)',  'Respiratory Syncytial Virus',
 'Influenza A',
 'Influenza B',
 'Parainfluenza 1',
 'CoronavirusNL63',
 'Rhinovirus/Enterovirus',
 'Coronavirus HKU1',
 'Parainfluenza 3',
 'Chlamydophila pneumoniae',
 'Adenovirus',
 'Parainfluenza 4',
 'Coronavirus229E',
 'CoronavirusOC43',
 'Inf A H1N1 2009',
 'Bordetella pertussis',
 'Metapneumovirus',
 'Parainfluenza 2','Influenza B, rapid test',
 'Influenza A, rapid test', 'Strepto A', 'Urine - Esterase',
 'Urine - Aspect',
 'Urine - Hemoglobin',
 'Urine - Bile pigments',
 'Urine - Ketone Bodies',
 'Urine - Nitrite',
 'Urine - Urobilinogen',
 'Urine - Protein',
 'Urine - Crystals',
 'Urine - Hyaline cylinders',
 'Urine - Granular cylinders',
 'Urine - Yeasts',
 'Urine - Color',]]

Univariate Analysis¶

In [14]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [15]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [16]:
## Code to visualize df columns in histogram_boxplot feature
for feature in df1.columns:
    histogram_boxplot(
        df1, feature, figsize=(12, 7), kde=False, bins=None,
    ) 

Observations¶

  • Many of the variables in this dataset have skewed data.
  • Majority of the variables have a mean of 0.
In [17]:
## Code to visualize df columns in histogram_boxplot feature
for feature in df2.columns:
    labeled_barplot(
        df2, feature, perc= True
    ) 

Observations¶

  • Majority of cases in the dataset (90.1%) tested negative for SARS CoV-2.
  • Few patients (0.9%) were admitted to the semi-intensive unit.
  • 0.7% were admitted into the intensive care unit.
  • A few of the cases had either isolated infections or co-infections from other virus (Respiratory Synctitial Virus-0.9%, Infuenza A- 0.3%, Influenza B- 1.4%, Parainfluenza 1-0.1%, Coronavirus NL63-0.8%)

Bivariate analysis¶

1. Age distribution of positive covid-19 cases vs negative cases¶

In [18]:
sns.boxplot(data=df, x='SARS-Cov-2 exam result', y='Patient age quantile')
plt.show()

Observations:¶

  • The mean age quantile for positive covid-19 cases is higher than negative cases.
  • The patient age quantile distribution for both positive and negative cases are comparable.

2. Age distribution among covid-19 positive cases in the ICU¶

In [19]:
sns.boxplot(data =df[df['SARS-Cov-2 exam result']=='positive'], x='Patient addmited to intensive care unit (1=yes, 0=no)', y='Patient age quantile')
plt.show()

Observation:¶

Among covid cases, patients in the higher age quantiles are more likely to be admitted into the ICU.

3. Age distribution of patients admitted to the ICU¶

In [20]:
sns.violinplot(data=df, x='Patient addmited to intensive care unit (1=yes, 0=no)', y='Patient age quantile', hue="SARS-Cov-2 exam result")
plt.show()

Observation:¶

  • Among ICU patients, Covid-19 patients are more advanced in age.

4. pCO2 (arterial blood gas analysis) vs. SARS-Cov-2 exam result¶

In [21]:
plt.figure(figsize=(10,5))
sns.boxplot(data=df, x="SARS-Cov-2 exam result", y='pCO2 (arterial blood gas analysis)' )
plt.show()

Observation:¶

  • Patients who were Covid-19 positive had a lower mean pCO2 (arterial blood gas analysis).

5. Urea vs. SARS-Cov-2 exam result¶

In [22]:
plt.figure(figsize=(10,5))
sns.boxplot(data=df, x="SARS-Cov-2 exam result", y='Urea' )
plt.show()

Observation:¶

  • There is no significant difference in the Urea levels of either group, positive or negative.

6. International normalized ratio (INR) vs SARS-Cov-2 exam result¶

In [23]:
plt.figure(figsize=(10,5))
sns.boxplot(data=df, x="SARS-Cov-2 exam result", y='International normalized ratio (INR)' )
plt.show()

Observation¶

  • There is no significant difference in the INR of either group.

7. Correlation Table¶

In [24]:
corr = df.corr() #code to evaluate correlation between variables
corr
Out[24]:
Patient age quantile Patient addmited to regular ward (1=yes, 0=no) Patient addmited to semi-intensive unit (1=yes, 0=no) Patient addmited to intensive care unit (1=yes, 0=no) Hematocrit Hemoglobin Platelets Mean platelet volume Red blood Cells Lymphocytes Mean corpuscular hemoglobin concentration (MCHC) Leukocytes Basophils Mean corpuscular hemoglobin (MCH) Eosinophils Mean corpuscular volume (MCV) Monocytes Red blood cell distribution width (RDW) Serum Glucose Mycoplasma pneumoniae Neutrophils Urea Proteina C reativa mg/dL Creatinine Potassium Sodium Alanine transaminase Aspartate transaminase Gamma-glutamyltransferase Total Bilirubin Direct Bilirubin Indirect Bilirubin Alkaline phosphatase Ionized calcium Magnesium pCO2 (venous blood gas analysis) Hb saturation (venous blood gas analysis) Base excess (venous blood gas analysis) pO2 (venous blood gas analysis) Fio2 (venous blood gas analysis) Total CO2 (venous blood gas analysis) pH (venous blood gas analysis) HCO3 (venous blood gas analysis) Rods # Segmented Promyelocytes Metamyelocytes Myelocytes Myeloblasts Urine - Density Urine - Sugar Urine - Red blood cells Partial thromboplastin time (PTT) Relationship (Patient/Normal) International normalized ratio (INR) Lactic Dehydrogenase Prothrombin time (PT), Activity Vitamin B12 Creatine phosphokinase (CPK) Ferritin Arterial Lactic Acid Lipase dosage D-Dimer Albumin Hb saturation (arterial blood gases) pCO2 (arterial blood gas analysis) Base excess (arterial blood gas analysis) pH (arterial blood gas analysis) Total CO2 (arterial blood gas analysis) HCO3 (arterial blood gas analysis) pO2 (arterial blood gas analysis) Arteiral Fio2 Phosphor ctO2 (arterial blood gas analysis)
Patient age quantile 1.000 0.046 0.016 -0.036 0.097 0.060 -0.159 0.119 -0.038 -0.126 -0.125 -0.166 0.108 0.197 0.022 0.282 0.051 0.166 0.216 NaN 0.087 0.338 0.088 0.373 0.002 -0.005 0.129 0.039 0.224 0.146 0.268 0.008 -0.481 -0.310 -0.128 0.208 -0.059 0.555 -0.071 NaN 0.503 0.256 0.511 0.047 0.284 0.130 0.179 0.089 NaN -0.118 NaN 0.160 NaN -0.123 0.014 -0.150 NaN 0.981 -0.101 0.396 0.097 -0.357 NaN -0.137 -0.224 -0.469 0.570 0.571 0.086 0.166 -0.098 -0.335 -0.512 -0.061
Patient addmited to regular ward (1=yes, 0=no) 0.046 1.000 -0.011 -0.010 -0.087 -0.092 -0.183 -0.013 -0.053 -0.095 -0.035 -0.103 0.032 -0.051 -0.086 -0.039 -0.000 0.102 0.059 NaN 0.127 -0.012 0.133 0.085 -0.027 -0.087 -0.004 -0.007 0.032 -0.030 -0.010 -0.040 -0.055 -0.187 -0.006 -0.136 -0.088 0.042 -0.071 NaN -0.041 0.181 -0.034 0.085 0.063 -0.031 -0.034 -0.070 NaN -0.202 NaN -0.049 NaN 0.024 -0.100 0.118 NaN NaN -0.080 0.410 -0.076 0.316 NaN NaN 0.198 -0.227 0.033 0.204 -0.160 -0.133 0.106 -0.174 NaN 0.273
Patient addmited to semi-intensive unit (1=yes, 0=no) 0.016 -0.011 1.000 -0.008 -0.182 -0.177 0.007 -0.023 -0.138 -0.111 -0.023 0.138 -0.133 -0.054 -0.090 -0.051 -0.038 0.092 0.198 NaN 0.087 0.082 0.241 -0.034 -0.014 -0.127 0.022 0.085 0.158 0.027 0.059 -0.007 0.296 0.034 -0.008 -0.160 0.151 -0.026 0.183 NaN -0.104 0.136 -0.097 0.185 0.083 0.239 0.279 0.416 NaN -0.141 NaN 0.394 NaN -0.183 0.083 0.191 NaN 0.615 -0.001 0.084 0.024 NaN NaN -0.661 -0.559 0.113 -0.226 -0.179 -0.113 -0.137 -0.339 -0.091 0.185 -0.049
Patient addmited to intensive care unit (1=yes, 0=no) -0.036 -0.010 -0.008 1.000 -0.184 -0.179 0.126 -0.074 -0.121 -0.110 -0.036 0.272 -0.121 -0.090 -0.089 -0.078 -0.104 0.194 0.124 NaN 0.103 0.199 0.305 -0.037 0.066 0.016 0.132 0.159 0.241 0.142 0.248 0.019 0.182 -0.273 0.156 0.069 0.190 -0.119 0.247 NaN -0.076 -0.148 -0.081 0.316 0.201 -0.049 0.063 -0.051 NaN -0.101 NaN -0.045 NaN -0.023 0.186 0.361 NaN NaN -0.020 0.820 -0.205 NaN NaN NaN 0.352 0.298 0.204 -0.180 0.425 0.411 0.156 0.348 0.130 -0.383
Hematocrit 0.097 -0.087 -0.182 -0.184 1.000 0.968 -0.082 0.084 0.873 0.002 0.131 -0.090 0.129 0.075 0.030 0.025 0.082 -0.265 -0.133 NaN -0.017 -0.071 -0.238 0.308 0.078 0.099 -0.064 -0.150 -0.279 0.014 -0.128 0.131 -0.282 0.149 -0.210 0.083 -0.102 0.140 -0.178 NaN 0.177 0.045 0.176 -0.219 0.072 -0.252 -0.329 -0.428 NaN 0.192 NaN -0.291 NaN -0.035 -0.050 -0.303 NaN -0.476 0.073 -0.538 0.112 0.170 NaN 0.537 -0.046 -0.180 -0.196 0.064 -0.344 -0.340 0.124 0.066 0.172 0.878
Hemoglobin 0.060 -0.092 -0.177 -0.179 0.968 1.000 -0.120 0.079 0.841 -0.004 0.372 -0.102 0.116 0.185 0.019 0.028 0.095 -0.342 -0.152 NaN -0.021 -0.084 -0.230 0.305 0.050 0.063 -0.042 -0.127 -0.258 0.058 -0.101 0.178 -0.274 0.182 -0.174 0.042 -0.118 0.141 -0.183 NaN 0.161 0.079 0.162 -0.239 0.073 -0.224 -0.311 -0.396 NaN 0.179 NaN -0.280 NaN -0.016 -0.012 -0.290 NaN -0.810 0.079 -0.537 0.034 0.184 NaN 0.556 -0.035 -0.179 -0.273 0.036 -0.419 -0.421 0.081 -0.003 0.260 0.884
Platelets -0.159 -0.183 0.007 0.126 -0.082 -0.120 1.000 -0.356 -0.055 0.091 -0.159 0.443 -0.026 -0.101 0.169 -0.034 -0.201 -0.008 -0.011 NaN -0.058 -0.013 0.004 -0.183 0.204 0.038 -0.058 -0.129 -0.061 -0.058 -0.089 -0.018 0.257 0.129 0.029 0.031 0.053 -0.176 0.068 NaN -0.095 -0.187 -0.103 -0.234 0.041 0.142 -0.047 -0.095 NaN 0.061 NaN -0.238 NaN -0.161 0.104 0.083 NaN -0.450 -0.088 -0.661 -0.031 -0.477 NaN 0.295 0.083 0.539 -0.296 -0.525 0.200 0.134 -0.138 0.472 0.125 -0.483
Mean platelet volume 0.119 -0.013 -0.023 -0.074 0.084 0.079 -0.356 1.000 0.043 0.079 -0.004 -0.155 0.129 0.069 -0.047 0.078 0.038 0.045 0.063 NaN -0.081 0.093 -0.062 0.122 -0.004 0.108 -0.015 0.050 0.081 0.039 0.133 -0.050 -0.211 0.110 -0.279 -0.010 -0.079 0.183 -0.094 NaN 0.147 0.145 0.153 0.264 0.003 -0.165 0.047 0.078 NaN -0.003 NaN 0.084 NaN -0.010 0.103 -0.205 NaN -0.999 0.210 0.024 0.302 0.084 NaN 0.539 -0.351 0.090 0.162 -0.010 0.262 0.267 -0.226 0.080 -0.222 0.018
Red blood Cells -0.038 -0.053 -0.138 -0.121 0.873 0.841 -0.055 0.043 1.000 -0.010 0.090 -0.036 0.079 -0.367 -0.004 -0.459 0.045 -0.138 -0.037 NaN 0.013 -0.121 -0.165 0.206 0.042 0.060 -0.022 -0.069 -0.301 0.015 -0.138 0.140 -0.013 0.042 -0.130 -0.016 -0.028 -0.029 -0.077 NaN 0.007 -0.009 0.006 -0.245 0.013 -0.236 -0.367 -0.445 NaN 0.219 NaN -0.275 NaN -0.021 -0.087 -0.066 NaN 0.323 0.103 -0.462 -0.013 0.612 NaN 0.441 0.029 -0.351 0.040 0.260 -0.303 -0.268 0.200 -0.258 0.190 0.848
Lymphocytes -0.126 -0.095 -0.111 -0.110 0.002 -0.004 0.091 0.079 -0.010 1.000 -0.028 -0.331 0.235 0.015 0.200 0.027 0.065 -0.080 -0.182 NaN -0.935 -0.108 -0.356 -0.175 0.113 0.209 -0.105 -0.122 -0.135 -0.206 -0.242 -0.127 0.067 0.482 -0.274 0.074 -0.106 -0.105 -0.144 NaN -0.006 -0.173 -0.016 -0.243 -0.933 -0.088 -0.085 -0.039 NaN 0.196 NaN -0.024 NaN 0.167 -0.181 -0.048 NaN -0.314 -0.125 -0.529 -0.177 -0.421 NaN 0.266 0.101 0.500 -0.403 -0.545 0.083 0.013 0.058 0.227 0.109 -0.136
Mean corpuscular hemoglobin concentration (MCHC) -0.125 -0.035 -0.023 -0.036 0.131 0.372 -0.159 -0.004 0.090 -0.028 1.000 -0.066 -0.026 0.474 -0.042 0.035 0.070 -0.394 -0.117 NaN -0.022 -0.083 -0.025 0.055 -0.104 -0.145 0.081 0.072 0.032 0.175 0.091 0.205 -0.023 0.165 0.085 -0.169 -0.074 0.010 -0.033 NaN -0.053 0.147 -0.045 -0.162 0.048 0.072 -0.021 -0.017 NaN -0.042 NaN -0.063 NaN 0.071 0.149 -0.009 NaN -0.928 0.050 -0.092 -0.179 0.032 NaN 0.068 0.009 -0.090 -0.344 -0.048 -0.411 -0.427 -0.067 -0.166 0.478 0.384
Leukocytes -0.166 -0.103 0.138 0.272 -0.090 -0.102 0.443 -0.155 -0.036 -0.331 -0.066 1.000 -0.304 -0.124 -0.092 -0.103 -0.295 0.128 0.185 NaN 0.402 0.115 0.361 -0.054 0.017 -0.050 0.022 0.029 0.071 0.158 0.197 0.087 0.271 -0.221 0.123 -0.182 0.162 -0.270 0.187 NaN -0.285 -0.026 -0.287 0.163 0.354 0.088 0.081 -0.028 NaN -0.038 NaN -0.167 NaN -0.098 0.094 0.278 NaN 0.839 0.055 0.427 0.184 -0.348 NaN -0.234 0.024 0.490 -0.493 -0.563 0.043 -0.032 -0.262 0.828 0.322 -0.201
Basophils 0.108 0.032 -0.133 -0.121 0.129 0.116 -0.026 0.129 0.079 0.235 -0.026 -0.304 1.000 0.065 0.335 0.085 0.099 0.038 -0.076 NaN -0.373 -0.020 -0.224 0.082 0.170 0.116 -0.005 -0.052 -0.001 0.039 0.016 0.049 -0.131 0.023 -0.354 0.293 -0.066 0.276 -0.123 NaN 0.352 -0.035 0.345 0.255 -0.059 -0.022 0.080 -0.050 NaN 0.006 NaN -0.183 NaN -0.023 -0.064 -0.134 NaN -0.323 -0.042 -0.051 -0.293 -0.693 NaN 0.277 0.059 -0.315 0.369 0.431 0.054 0.101 0.190 -0.322 -0.214 -0.053
Mean corpuscular hemoglobin (MCH) 0.197 -0.051 -0.054 -0.090 0.075 0.185 -0.101 0.069 -0.367 0.015 0.474 -0.124 0.065 1.000 0.030 0.895 0.093 -0.300 -0.188 NaN -0.064 0.101 -0.105 0.192 0.012 -0.008 -0.042 -0.099 0.136 0.084 0.118 0.035 -0.439 0.307 -0.052 0.108 -0.138 0.309 -0.146 NaN 0.278 0.156 0.282 -0.013 0.122 0.016 0.089 0.075 NaN -0.080 NaN -0.040 NaN 0.000 0.156 -0.374 NaN -0.840 -0.037 -0.114 0.092 -0.253 NaN 0.179 -0.139 0.321 -0.607 -0.431 -0.239 -0.309 -0.210 0.422 0.277 0.130
Eosinophils 0.022 -0.086 -0.090 -0.089 0.030 0.019 0.169 -0.047 -0.004 0.200 -0.042 -0.092 0.335 0.030 1.000 0.054 0.009 -0.008 -0.025 NaN -0.383 0.139 -0.180 0.015 0.095 0.223 0.030 -0.068 -0.013 -0.065 -0.092 -0.027 -0.167 0.305 0.184 0.093 0.040 0.216 0.004 NaN 0.232 0.069 0.233 -0.209 -0.146 -0.071 0.075 -0.063 NaN 0.097 NaN -0.085 NaN 0.017 0.142 -0.334 NaN 0.183 -0.032 -0.289 -0.149 -0.490 NaN 0.149 0.296 -0.022 0.357 0.131 0.334 0.357 0.134 -0.118 -0.308 -0.433
Mean corpuscular volume (MCV) 0.282 -0.039 -0.051 -0.078 0.025 0.028 -0.034 0.078 -0.459 0.027 0.035 -0.103 0.085 0.895 0.054 1.000 0.066 -0.152 -0.163 NaN -0.059 0.152 -0.103 0.184 0.071 0.059 -0.081 -0.140 0.138 0.014 0.092 -0.054 -0.482 0.256 -0.112 0.212 -0.115 0.334 -0.146 NaN 0.334 0.091 0.334 0.067 0.120 -0.010 0.101 0.079 NaN -0.095 NaN -0.019 NaN -0.026 0.104 -0.398 NaN -0.807 -0.066 -0.084 0.225 -0.323 NaN 0.131 -0.177 0.466 -0.528 -0.510 -0.020 -0.098 -0.216 0.640 0.068 -0.082
Monocytes 0.051 -0.000 -0.038 -0.104 0.082 0.095 -0.201 0.038 0.045 0.065 0.070 -0.295 0.099 0.093 0.009 0.066 1.000 -0.016 -0.211 NaN -0.299 -0.060 -0.050 0.114 -0.047 -0.021 -0.034 -0.082 -0.078 0.033 0.014 0.042 -0.196 -0.028 -0.356 0.023 -0.071 0.097 -0.100 NaN 0.071 0.040 0.074 0.015 -0.349 -0.037 0.045 0.048 NaN -0.028 NaN -0.102 NaN 0.101 0.060 0.009 NaN -0.922 -0.049 0.262 -0.454 -0.652 NaN -0.106 -0.025 0.385 -0.241 -0.409 0.126 0.076 -0.009 -0.123 0.043 0.078
Red blood cell distribution width (RDW) 0.166 0.102 0.092 0.194 -0.265 -0.342 -0.008 0.045 -0.138 -0.080 -0.394 0.128 0.038 -0.300 -0.008 -0.152 -0.016 1.000 0.296 NaN 0.053 0.215 0.179 0.157 0.066 -0.031 0.078 0.156 0.292 0.243 0.419 0.037 0.115 -0.078 0.171 0.104 -0.118 0.095 -0.074 NaN 0.121 -0.036 0.119 0.167 0.096 0.116 0.231 0.188 NaN -0.055 NaN 0.170 NaN 0.050 -0.003 0.335 NaN 0.172 0.027 0.864 0.296 -0.189 NaN -0.384 -0.071 0.393 -0.048 -0.318 0.367 0.331 -0.113 0.402 0.125 -0.316
Serum Glucose 0.216 0.059 0.198 0.124 -0.133 -0.152 -0.011 0.063 -0.037 -0.182 -0.117 0.185 -0.076 -0.188 -0.025 -0.163 -0.211 0.296 1.000 NaN 0.176 0.187 0.214 0.103 0.008 -0.180 0.414 0.483 0.278 0.100 0.261 -0.072 0.031 -0.381 0.328 -0.022 0.159 0.135 0.164 NaN 0.097 0.105 0.106 0.269 0.352 NaN 0.148 -0.046 NaN -0.141 NaN -0.032 NaN 0.131 0.158 0.184 NaN NaN -0.018 0.901 0.399 -1.000 NaN -0.791 0.025 0.112 0.213 0.121 0.207 0.212 -0.157 -0.316 0.139 -0.473
Mycoplasma pneumoniae NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Neutrophils 0.087 0.127 0.087 0.103 -0.017 -0.021 -0.058 -0.081 0.013 -0.935 -0.022 0.402 -0.373 -0.064 -0.383 -0.059 -0.299 0.053 0.176 NaN 1.000 -0.008 0.336 0.083 -0.134 -0.206 0.077 0.109 0.025 0.042 0.068 0.020 0.017 -0.366 0.263 -0.037 0.117 -0.040 0.168 NaN -0.101 0.043 -0.097 0.127 0.882 NaN -0.240 NaN NaN -0.289 NaN 0.032 NaN -0.111 0.092 0.122 NaN -1.000 0.105 -0.116 0.355 0.884 NaN 0.393 -0.123 -0.252 0.131 0.363 -0.041 -0.019 -0.211 0.650 0.146 -0.063
Urea 0.338 -0.012 0.082 0.199 -0.071 -0.084 -0.013 0.093 -0.121 -0.108 -0.083 0.115 -0.020 0.101 0.139 0.152 -0.060 0.215 0.187 NaN -0.008 1.000 0.175 0.522 0.120 0.182 0.039 -0.026 0.546 0.328 0.546 0.073 -0.020 0.008 0.269 0.041 -0.009 0.351 0.033 NaN 0.285 0.235 0.295 0.129 0.290 -0.030 0.314 0.306 NaN 0.199 NaN 0.281 NaN -0.272 0.124 0.044 NaN 0.970 0.113 0.549 0.244 -0.029 NaN -0.402 0.215 0.186 0.255 0.008 0.329 0.326 -0.051 0.382 -0.072 -0.550
Proteina C reativa mg/dL 0.088 0.133 0.241 0.305 -0.238 -0.230 0.004 -0.062 -0.165 -0.356 -0.025 0.361 -0.224 -0.105 -0.180 -0.103 -0.050 0.179 0.214 NaN 0.336 0.175 1.000 0.192 0.036 -0.241 0.253 0.227 0.208 0.213 0.286 0.084 0.061 -0.418 0.005 -0.038 0.012 -0.029 0.057 NaN -0.050 0.013 -0.050 0.507 0.292 0.087 0.313 0.281 NaN 0.013 NaN 0.203 NaN -0.027 0.279 0.298 NaN 0.634 -0.041 0.781 0.147 0.405 NaN -0.851 0.232 -0.254 0.035 0.214 -0.198 -0.180 0.095 0.382 -0.307 -0.062
Creatinine 0.373 0.085 -0.034 -0.037 0.308 0.305 -0.183 0.122 0.206 -0.175 0.055 -0.054 0.082 0.192 0.015 0.184 0.114 0.157 0.103 NaN 0.083 0.522 0.192 1.000 0.075 0.053 0.072 -0.074 0.299 0.351 0.400 0.227 -0.436 -0.175 -0.011 0.204 -0.213 0.382 -0.169 NaN 0.389 0.099 0.391 0.141 0.259 -0.100 0.242 0.154 NaN 0.140 NaN 0.199 NaN -0.111 0.146 -0.195 NaN 0.748 0.145 0.406 0.229 0.737 NaN 0.321 0.093 -0.012 -0.104 -0.081 -0.070 -0.074 0.073 0.717 -0.115 0.157
Potassium 0.002 -0.027 -0.014 0.066 0.078 0.050 0.204 -0.004 0.042 0.113 -0.104 0.017 0.170 0.012 0.095 0.071 -0.047 0.066 0.008 NaN -0.134 0.120 0.036 0.075 1.000 0.016 -0.110 -0.071 -0.010 0.038 0.003 0.063 0.164 -0.119 -0.233 0.191 0.016 -0.178 0.063 NaN -0.026 -0.323 -0.043 0.054 0.062 0.027 -0.000 -0.093 NaN 0.397 NaN 0.083 NaN -0.183 0.129 0.225 NaN -0.372 -0.136 -0.207 0.184 -0.095 NaN 0.425 0.093 0.028 -0.008 0.012 0.018 0.012 0.090 0.364 0.136 -0.132
Sodium -0.005 -0.087 -0.127 0.016 0.099 0.063 0.038 0.108 0.060 0.209 -0.145 -0.050 0.116 -0.008 0.223 0.059 -0.021 -0.031 -0.180 NaN -0.206 0.182 -0.241 0.053 0.016 1.000 -0.189 -0.245 0.207 0.050 0.157 -0.065 -0.049 0.302 0.193 0.053 -0.147 0.162 -0.157 NaN 0.188 0.066 0.189 0.069 -0.214 -0.453 -0.043 0.014 NaN 0.067 NaN -0.157 NaN -0.142 -0.133 -0.150 NaN 0.658 -0.100 -0.350 -0.053 0.023 NaN 0.235 0.255 0.155 0.130 -0.082 0.309 0.303 0.099 0.113 -0.413 -0.183
Alanine transaminase 0.129 -0.004 0.022 0.132 -0.064 -0.042 -0.058 -0.015 -0.022 -0.105 0.081 0.022 -0.005 -0.042 0.030 -0.081 -0.034 0.078 0.414 NaN 0.077 0.039 0.253 0.072 -0.110 -0.189 1.000 0.840 0.604 0.104 0.191 0.006 0.265 -0.157 0.252 -0.009 0.005 0.080 0.125 NaN 0.069 0.074 0.071 0.600 0.109 NaN 0.257 0.192 NaN -0.266 NaN -0.071 NaN 0.126 0.228 0.194 NaN 0.928 0.037 0.785 -0.009 0.199 NaN -0.220 -0.162 0.094 -0.132 -0.246 -0.034 -0.044 -0.214 -0.201 -0.021 0.069
Aspartate transaminase 0.039 -0.007 0.085 0.159 -0.150 -0.127 -0.129 0.050 -0.069 -0.122 0.072 0.029 -0.052 -0.099 -0.068 -0.140 -0.082 0.156 0.483 NaN 0.109 -0.026 0.227 -0.074 -0.071 -0.245 0.840 1.000 0.553 0.044 0.156 -0.059 0.464 -0.195 0.289 -0.088 0.109 -0.067 0.262 NaN -0.083 0.019 -0.080 0.393 0.105 NaN 0.245 0.047 NaN -0.144 NaN -0.075 NaN 0.322 0.324 0.431 NaN 0.684 0.034 0.816 -0.035 0.310 NaN -0.486 -0.246 0.022 -0.173 -0.198 -0.090 -0.100 -0.226 -0.222 0.032 0.059
Gamma-glutamyltransferase 0.224 0.032 0.158 0.241 -0.279 -0.258 -0.061 0.081 -0.301 -0.135 0.032 0.071 -0.001 0.136 -0.013 0.138 -0.078 0.292 0.278 NaN 0.025 0.546 0.208 0.299 -0.010 0.207 0.604 0.553 1.000 0.281 0.517 0.012 0.254 -0.090 0.401 -0.114 0.039 0.353 0.143 NaN 0.254 0.456 0.270 0.168 0.271 NaN 0.618 0.944 NaN -0.359 NaN -0.105 NaN -0.145 -0.036 0.332 NaN 0.375 -0.051 0.841 -0.301 0.350 NaN -0.273 -0.107 -0.225 -0.086 0.201 -0.146 -0.145 0.357 -0.745 0.088 -0.082
Total Bilirubin 0.146 -0.030 0.027 0.142 0.014 0.058 -0.058 0.039 0.015 -0.206 0.175 0.158 0.039 0.084 -0.065 0.014 0.033 0.243 0.100 NaN 0.042 0.328 0.213 0.351 0.038 0.050 0.104 0.044 0.281 1.000 0.847 0.894 0.073 -0.320 0.300 -0.043 -0.176 0.166 -0.135 NaN 0.108 0.165 0.116 0.087 0.430 NaN 0.358 0.304 NaN -0.213 NaN 0.365 NaN -0.121 0.247 0.297 NaN NaN 0.015 0.839 0.420 0.234 NaN 0.159 -0.124 0.087 -0.232 -0.318 -0.148 -0.158 -0.056 0.420 0.402 0.254
Direct Bilirubin 0.268 -0.010 0.059 0.248 -0.128 -0.101 -0.089 0.133 -0.138 -0.242 0.091 0.197 0.016 0.118 -0.092 0.092 0.014 0.419 0.261 NaN 0.068 0.546 0.286 0.400 0.003 0.157 0.191 0.156 0.517 0.847 1.000 0.518 0.103 -0.269 0.412 -0.053 -0.175 0.299 -0.154 NaN 0.205 0.266 0.218 0.069 0.363 NaN 0.335 0.211 NaN -0.294 NaN 0.171 NaN -0.127 0.262 0.375 NaN NaN -0.035 0.947 0.444 -0.339 NaN -0.087 -0.108 -0.071 -0.212 -0.095 -0.201 -0.208 -0.032 0.398 0.394 0.068
Indirect Bilirubin 0.008 -0.040 -0.007 0.019 0.131 0.178 -0.018 -0.050 0.140 -0.127 0.205 0.087 0.049 0.035 -0.027 -0.054 0.042 0.037 -0.072 NaN 0.020 0.073 0.084 0.227 0.063 -0.065 0.006 -0.059 0.012 0.894 0.518 1.000 0.030 -0.329 -0.085 -0.022 -0.132 -0.028 -0.079 NaN -0.030 0.008 -0.030 0.073 0.325 NaN 0.222 0.298 NaN -0.129 NaN 0.437 NaN -0.092 0.186 0.160 NaN NaN 0.047 0.519 0.283 0.348 NaN 0.244 -0.113 0.257 -0.196 -0.507 -0.044 -0.057 -0.071 0.334 0.259 0.416
Alkaline phosphatase -0.481 -0.055 0.296 0.182 -0.282 -0.274 0.257 -0.211 -0.013 0.067 -0.023 0.271 -0.131 -0.439 -0.167 -0.482 -0.196 0.115 0.031 NaN 0.017 -0.020 0.061 -0.436 0.164 -0.049 0.265 0.464 0.254 0.073 0.103 0.030 1.000 0.134 0.265 -0.504 0.284 -0.520 0.332 NaN -0.581 -0.062 -0.574 -0.085 -0.042 NaN 0.077 0.204 NaN -0.206 NaN -0.164 NaN 0.131 0.245 0.563 NaN 0.833 0.126 0.044 -0.345 -0.122 NaN -0.247 -0.351 -0.132 0.039 0.235 -0.066 -0.066 0.107 0.180 0.332 0.071
Ionized calcium -0.310 -0.187 0.034 -0.273 0.149 0.182 0.129 0.110 0.042 0.482 0.165 -0.221 0.023 0.307 0.305 0.256 -0.028 -0.078 -0.381 NaN -0.366 0.008 -0.418 -0.175 -0.119 0.302 -0.157 -0.195 -0.090 -0.320 -0.269 -0.329 0.134 1.000 0.045 0.117 0.003 -0.069 0.023 NaN 0.033 -0.172 0.024 -0.464 -0.412 -0.293 -0.451 -0.188 NaN 0.217 NaN -0.404 NaN 0.258 -0.140 -0.427 NaN -1.000 0.084 -0.949 0.651 NaN NaN 0.517 -0.525 0.916 -0.616 -0.834 0.365 0.153 -0.374 0.081 0.384 -0.114
Magnesium -0.128 -0.006 -0.008 0.156 -0.210 -0.174 0.029 -0.279 -0.130 -0.274 0.085 0.123 -0.354 -0.052 0.184 -0.112 -0.356 0.171 0.328 NaN 0.263 0.269 0.005 -0.011 -0.233 0.193 0.252 0.289 0.401 0.300 0.412 -0.085 0.265 0.045 1.000 -0.026 -0.063 0.190 0.043 NaN 0.210 0.120 0.210 -0.478 0.644 NaN 0.000 0.181 NaN -0.430 NaN -0.470 NaN 0.378 -0.058 0.137 NaN -0.375 0.101 0.641 0.075 1.000 NaN 0.012 0.141 -0.188 0.453 0.325 0.521 0.557 -0.213 0.168 0.152 -0.691
pCO2 (venous blood gas analysis) 0.208 -0.136 -0.160 0.069 0.083 0.042 0.031 -0.010 -0.016 0.074 -0.169 -0.182 0.293 0.108 0.093 0.212 0.023 0.104 -0.022 NaN -0.037 0.041 -0.038 0.204 0.191 0.053 -0.009 -0.088 -0.114 -0.043 -0.053 -0.022 -0.504 0.117 -0.026 1.000 -0.409 0.389 -0.369 NaN 0.726 -0.653 0.693 0.171 -0.011 0.296 0.118 -0.004 NaN 0.399 NaN 0.192 NaN -0.063 -0.068 -0.173 NaN NaN 0.019 -0.766 0.160 -1.000 NaN NaN 0.584 0.246 0.389 0.023 0.540 0.536 0.486 0.092 0.032 -0.333
Hb saturation (venous blood gas analysis) -0.059 -0.088 0.151 0.190 -0.102 -0.118 0.053 -0.079 -0.028 -0.106 -0.074 0.162 -0.066 -0.138 0.040 -0.115 -0.071 -0.118 0.159 NaN 0.117 -0.009 0.012 -0.213 0.016 -0.147 0.005 0.109 0.039 -0.176 -0.175 -0.132 0.284 0.003 -0.063 -0.409 1.000 -0.142 0.911 NaN -0.325 0.322 -0.308 -0.051 0.094 -0.007 0.018 -0.168 NaN -0.381 NaN -0.234 NaN -0.045 0.030 -0.106 NaN NaN -0.154 0.995 0.669 1.000 NaN NaN -0.032 0.513 -0.353 -0.706 0.063 0.025 -0.283 0.833 -0.108 0.250
Base excess (venous blood gas analysis) 0.555 0.042 -0.026 -0.119 0.140 0.141 -0.176 0.183 -0.029 -0.105 0.010 -0.270 0.276 0.309 0.216 0.334 0.097 0.095 0.135 NaN -0.040 0.351 -0.029 0.382 -0.178 0.162 0.080 -0.067 0.353 0.166 0.299 -0.028 -0.520 -0.069 0.190 0.389 -0.142 1.000 -0.218 NaN 0.903 0.431 0.923 0.043 0.265 0.331 0.257 0.223 NaN 0.064 NaN 0.048 NaN -0.251 0.072 -0.312 NaN NaN 0.037 0.334 -0.284 -1.000 NaN NaN 0.341 -0.107 0.932 0.729 0.647 0.678 0.360 -0.307 -0.456 -0.813
pO2 (venous blood gas analysis) -0.071 -0.071 0.183 0.247 -0.178 -0.183 0.068 -0.094 -0.077 -0.144 -0.033 0.187 -0.123 -0.146 0.004 -0.146 -0.100 -0.074 0.164 NaN 0.168 0.033 0.057 -0.169 0.063 -0.157 0.125 0.262 0.143 -0.135 -0.154 -0.079 0.332 0.023 0.043 -0.369 0.911 -0.218 1.000 NaN -0.365 0.214 -0.352 -0.049 0.115 -0.055 0.022 -0.171 NaN -0.353 NaN -0.204 NaN -0.053 0.022 -0.025 NaN NaN -0.116 0.951 0.648 1.000 NaN NaN -0.149 0.378 -0.335 -0.552 -0.028 -0.060 -0.340 0.878 -0.118 0.294
Fio2 (venous blood gas analysis) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Total CO2 (venous blood gas analysis) 0.503 -0.041 -0.104 -0.076 0.177 0.161 -0.095 0.147 0.007 -0.006 -0.053 -0.285 0.352 0.278 0.232 0.334 0.071 0.121 0.097 NaN -0.101 0.285 -0.050 0.389 -0.026 0.188 0.069 -0.083 0.254 0.108 0.205 -0.030 -0.581 0.033 0.210 0.726 -0.325 0.903 -0.365 NaN 1.000 0.020 0.999 0.085 0.170 0.370 0.232 0.137 NaN 0.227 NaN 0.057 NaN -0.203 0.016 -0.299 NaN NaN 0.041 0.131 -0.101 -1.000 NaN NaN 0.523 0.079 0.859 0.503 0.742 0.762 0.501 -0.237 -0.336 -0.741
pH (venous blood gas analysis) 0.256 0.181 0.136 -0.148 0.045 0.079 -0.187 0.145 -0.009 -0.173 0.147 -0.026 -0.035 0.156 0.069 0.091 0.040 -0.036 0.105 NaN 0.043 0.235 0.013 0.099 -0.323 0.066 0.074 0.019 0.456 0.165 0.266 0.008 -0.062 -0.172 0.120 -0.653 0.322 0.431 0.214 NaN 0.020 1.000 0.067 -0.083 0.224 -0.071 0.047 0.150 NaN -0.390 NaN -0.196 NaN -0.159 0.107 -0.139 NaN NaN 0.016 0.474 -0.433 1.000 NaN NaN -0.214 -0.368 0.491 0.685 0.068 0.101 -0.095 -0.464 -0.461 -0.417
HCO3 (venous blood gas analysis) 0.511 -0.034 -0.097 -0.081 0.176 0.162 -0.103 0.153 0.006 -0.016 -0.045 -0.287 0.345 0.282 0.233 0.334 0.074 0.119 0.106 NaN -0.097 0.295 -0.050 0.391 -0.043 0.189 0.071 -0.080 0.270 0.116 0.218 -0.030 -0.574 0.024 0.210 0.693 -0.308 0.923 -0.352 NaN 0.999 0.067 1.000 0.071 0.181 0.368 0.236 0.147 NaN 0.208 NaN 0.052 NaN -0.207 0.024 -0.301 NaN NaN 0.043 0.154 -0.119 -1.000 NaN NaN 0.506 0.063 0.876 0.530 0.740 0.761 0.496 -0.255 -0.353 -0.753
Rods # 0.047 0.085 0.185 0.316 -0.219 -0.239 -0.234 0.264 -0.245 -0.243 -0.162 0.163 0.255 -0.013 -0.209 0.067 0.015 0.167 0.269 NaN 0.127 0.129 0.507 0.141 0.054 0.069 0.600 0.393 0.168 0.087 0.069 0.073 -0.085 -0.464 -0.478 0.171 -0.051 0.043 -0.049 NaN 0.085 -0.083 0.071 1.000 0.079 0.066 0.435 0.320 NaN -0.264 NaN 0.223 NaN -0.346 -0.096 0.036 NaN NaN -0.163 1.000 -0.047 -1.000 NaN NaN 0.253 -0.180 0.061 0.143 -0.085 -0.068 0.043 -0.083 -0.483 0.035
Segmented 0.284 0.063 0.083 0.201 0.072 0.073 0.041 0.003 0.013 -0.933 0.048 0.354 -0.059 0.122 -0.146 0.120 -0.349 0.096 0.352 NaN 0.882 0.290 0.292 0.259 0.062 -0.214 0.109 0.105 0.271 0.430 0.363 0.325 -0.042 -0.412 0.644 -0.011 0.094 0.265 0.115 NaN 0.170 0.224 0.181 0.079 1.000 0.072 -0.031 -0.042 NaN 0.030 NaN -0.222 NaN -0.282 0.362 -0.112 NaN NaN 0.279 0.523 0.355 -1.000 NaN -1.000 0.029 -0.644 0.442 0.684 -0.288 -0.206 0.511 -0.234 -0.318 0.483
Promyelocytes 0.130 -0.031 0.239 -0.049 -0.252 -0.224 0.142 -0.165 -0.236 -0.088 0.072 0.088 -0.022 0.016 -0.071 -0.010 -0.037 0.116 NaN NaN NaN -0.030 0.087 -0.100 0.027 -0.453 NaN NaN NaN NaN NaN NaN NaN -0.293 NaN 0.296 -0.007 0.331 -0.055 NaN 0.370 -0.071 0.368 0.066 0.072 1.000 0.297 0.207 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Metamyelocytes 0.179 -0.034 0.279 0.063 -0.329 -0.311 -0.047 0.047 -0.367 -0.085 -0.021 0.081 0.080 0.089 0.075 0.101 0.045 0.231 0.148 NaN -0.240 0.314 0.313 0.242 -0.000 -0.043 0.257 0.245 0.618 0.358 0.335 0.222 0.077 -0.451 0.000 0.118 0.018 0.257 0.022 NaN 0.232 0.047 0.236 0.435 -0.031 0.297 1.000 0.716 NaN -0.163 NaN 0.641 NaN -0.316 -0.095 0.319 NaN NaN NaN 1.000 -0.026 NaN NaN NaN 0.551 -0.503 -0.156 0.380 -0.557 -0.543 0.273 0.339 -0.538 0.367
Myelocytes 0.089 -0.070 0.416 -0.051 -0.428 -0.396 -0.095 0.078 -0.445 -0.039 -0.017 -0.028 -0.050 0.075 -0.063 0.079 0.048 0.188 -0.046 NaN NaN 0.306 0.281 0.154 -0.093 0.014 0.192 0.047 0.944 0.304 0.211 0.298 0.204 -0.188 0.181 -0.004 -0.168 0.223 -0.171 NaN 0.137 0.150 0.147 0.320 -0.042 0.207 0.716 1.000 NaN -0.151 NaN 0.977 NaN -0.235 -0.157 NaN NaN NaN NaN NaN 0.036 NaN NaN NaN 0.467 -0.451 -0.452 0.282 -0.798 -0.815 0.211 NaN -0.513 0.031
Myeloblasts NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Urine - Density -0.118 -0.202 -0.141 -0.101 0.192 0.179 0.061 -0.003 0.219 0.196 -0.042 -0.038 0.006 -0.080 0.097 -0.095 -0.028 -0.055 -0.141 NaN -0.289 0.199 0.013 0.140 0.397 0.067 -0.266 -0.144 -0.359 -0.213 -0.294 -0.129 -0.206 0.217 -0.430 0.399 -0.381 0.064 -0.353 NaN 0.227 -0.390 0.208 -0.264 0.030 NaN -0.163 -0.151 NaN 1.000 NaN 0.033 NaN 0.089 0.054 -0.517 NaN NaN 0.391 -0.405 0.894 0.735 NaN 0.000 0.467 0.291 -0.294 -0.281 0.213 0.113 0.269 0.874 -0.154 -0.211
Urine - Sugar NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Urine - Red blood cells 0.160 -0.049 0.394 -0.045 -0.291 -0.280 -0.238 0.084 -0.275 -0.024 -0.063 -0.167 -0.183 -0.040 -0.085 -0.019 -0.102 0.170 -0.032 NaN 0.032 0.281 0.203 0.199 0.083 -0.157 -0.071 -0.075 -0.105 0.365 0.171 0.437 -0.164 -0.404 -0.470 0.192 -0.234 0.048 -0.204 NaN 0.057 -0.196 0.052 0.223 -0.222 NaN 0.641 0.977 NaN 0.033 NaN 1.000 NaN 0.217 0.167 -0.316 NaN NaN -0.013 -0.135 0.933 0.599 NaN -0.303 0.044 0.689 -0.000 -0.401 0.698 0.616 -0.503 0.940 -0.762 -0.051
Partial thromboplastin time (PTT) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Relationship (Patient/Normal) -0.123 0.024 -0.183 -0.023 -0.035 -0.016 -0.161 -0.010 -0.021 0.167 0.071 -0.098 -0.023 0.000 0.017 -0.026 0.101 0.050 0.131 NaN -0.111 -0.272 -0.027 -0.111 -0.183 -0.142 0.126 0.322 -0.145 -0.121 -0.127 -0.092 0.131 0.258 0.378 -0.063 -0.045 -0.251 -0.053 NaN -0.203 -0.159 -0.207 -0.346 -0.282 NaN -0.316 -0.235 NaN 0.089 NaN 0.217 NaN 1.000 0.230 0.057 NaN NaN 0.038 -0.444 -0.396 NaN NaN -0.786 0.389 -0.161 -0.116 0.062 -0.152 -0.150 0.247 -0.554 0.211 0.161
International normalized ratio (INR) 0.014 -0.100 0.083 0.186 -0.050 -0.012 0.104 0.103 -0.087 -0.181 0.149 0.094 -0.064 0.156 0.142 0.104 0.060 -0.003 0.158 NaN 0.092 0.124 0.279 0.146 0.129 -0.133 0.228 0.324 -0.036 0.247 0.262 0.186 0.245 -0.140 -0.058 -0.068 0.030 0.072 0.022 NaN 0.016 0.107 0.024 -0.096 0.362 NaN -0.095 -0.157 NaN 0.054 NaN 0.167 NaN 0.230 1.000 0.149 NaN NaN 0.171 0.635 -0.129 NaN NaN 0.619 0.120 0.216 0.228 -0.108 0.257 0.252 0.040 0.105 -0.030 0.000
Lactic Dehydrogenase -0.150 0.118 0.191 0.361 -0.303 -0.290 0.083 -0.205 -0.066 -0.048 -0.009 0.278 -0.134 -0.374 -0.334 -0.398 0.009 0.335 0.184 NaN 0.122 0.044 0.298 -0.195 0.225 -0.150 0.194 0.431 0.332 0.297 0.375 0.160 0.563 -0.427 0.137 -0.173 -0.106 -0.312 -0.025 NaN -0.299 -0.139 -0.301 0.036 -0.112 NaN 0.319 NaN NaN -0.517 NaN -0.316 NaN 0.057 0.149 1.000 NaN NaN 0.043 0.894 -0.454 NaN NaN -1.000 -0.456 -0.165 -0.252 0.073 -0.324 -0.320 -0.331 0.003 -1.000 -0.417
Prothrombin time (PT), Activity NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Vitamin B12 0.981 NaN 0.615 NaN -0.476 -0.810 -0.450 -0.999 0.323 -0.314 -0.928 0.839 -0.323 -0.840 0.183 -0.807 -0.922 0.172 NaN NaN -1.000 0.970 0.634 0.748 -0.372 0.658 0.928 0.684 0.375 NaN NaN NaN 0.833 -1.000 -0.375 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.000 NaN -1.000 NaN NaN NaN -1.000 NaN NaN NaN NaN NaN NaN NaN NaN -0.024 NaN
Creatine phosphokinase (CPK) -0.101 -0.080 -0.001 -0.020 0.073 0.079 -0.088 0.210 0.103 -0.125 0.050 0.055 -0.042 -0.037 -0.032 -0.066 -0.049 0.027 -0.018 NaN 0.105 0.113 -0.041 0.145 -0.136 -0.100 0.037 0.034 -0.051 0.015 -0.035 0.047 0.126 0.084 0.101 0.019 -0.154 0.037 -0.116 NaN 0.041 0.016 0.043 -0.163 0.279 NaN NaN NaN NaN 0.391 NaN -0.013 NaN 0.038 0.171 0.043 NaN NaN 1.000 -0.296 -0.131 -1.000 NaN 0.722 -0.834 0.795 0.428 -0.462 0.556 0.549 -0.474 -0.017 -0.071 0.079
Ferritin 0.396 0.410 0.084 0.820 -0.538 -0.537 -0.661 0.024 -0.462 -0.529 -0.092 0.427 -0.051 -0.114 -0.289 -0.084 0.262 0.864 0.901 NaN -0.116 0.549 0.781 0.406 -0.207 -0.350 0.785 0.816 0.841 0.839 0.947 0.519 0.044 -0.949 0.641 -0.766 0.995 0.334 0.951 NaN 0.131 0.474 0.154 1.000 0.523 NaN 1.000 NaN NaN -0.405 NaN -0.135 NaN -0.444 0.635 0.894 NaN -1.000 -0.296 1.000 NaN -1.000 NaN 0.279 NaN NaN NaN NaN NaN NaN NaN NaN 0.310 NaN
Arterial Lactic Acid 0.097 -0.076 0.024 -0.205 0.112 0.034 -0.031 0.302 -0.013 -0.177 -0.179 0.184 -0.293 0.092 -0.149 0.225 -0.454 0.296 0.399 NaN 0.355 0.244 0.147 0.229 0.184 -0.053 -0.009 -0.035 -0.301 0.420 0.444 0.283 -0.345 0.651 0.075 0.160 0.669 -0.284 0.648 NaN -0.101 -0.433 -0.119 -0.047 0.355 NaN -0.026 0.036 NaN 0.894 NaN 0.933 NaN -0.396 -0.129 -0.454 NaN NaN -0.131 NaN 1.000 NaN NaN NaN -0.039 0.042 -0.332 -0.167 -0.205 -0.225 0.128 0.062 1.000 0.124
Lipase dosage -0.357 0.316 NaN NaN 0.170 0.184 -0.477 0.084 0.612 -0.421 0.032 -0.348 -0.693 -0.253 -0.490 -0.323 -0.652 -0.189 -1.000 NaN 0.884 -0.029 0.405 0.737 -0.095 0.023 0.199 0.310 0.350 0.234 -0.339 0.348 -0.122 NaN 1.000 -1.000 1.000 -1.000 1.000 NaN -1.000 1.000 -1.000 -1.000 -1.000 NaN NaN NaN NaN 0.735 NaN 0.599 NaN NaN NaN NaN NaN NaN -1.000 -1.000 NaN 1.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
D-Dimer NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Albumin -0.137 NaN -0.661 NaN 0.537 0.556 0.295 0.539 0.441 0.266 0.068 -0.234 0.277 0.179 0.149 0.131 -0.106 -0.384 -0.791 NaN 0.393 -0.402 -0.851 0.321 0.425 0.235 -0.220 -0.486 -0.273 0.159 -0.087 0.244 -0.247 0.517 0.012 NaN NaN NaN NaN NaN NaN NaN NaN NaN -1.000 NaN NaN NaN NaN 0.000 NaN -0.303 NaN -0.786 0.619 -1.000 NaN -1.000 0.722 0.279 NaN NaN NaN 1.000 NaN NaN NaN NaN NaN NaN NaN NaN 0.023 NaN
Hb saturation (arterial blood gases) -0.224 0.198 -0.559 0.352 -0.046 -0.035 0.083 -0.351 0.029 0.101 0.009 0.024 0.059 -0.139 0.296 -0.177 -0.025 -0.071 0.025 NaN -0.123 0.215 0.232 0.093 0.093 0.255 -0.162 -0.246 -0.107 -0.124 -0.108 -0.113 -0.351 -0.525 0.141 0.584 -0.032 0.341 -0.149 NaN 0.523 -0.214 0.506 0.253 0.029 NaN 0.551 0.467 NaN 0.467 NaN 0.044 NaN 0.389 0.120 -0.456 NaN NaN -0.834 NaN -0.039 NaN NaN NaN 1.000 -0.291 -0.152 0.180 -0.355 -0.349 0.796 -0.018 -1.000 0.128
pCO2 (arterial blood gas analysis) -0.469 -0.227 0.113 0.298 -0.180 -0.179 0.539 0.090 -0.351 0.500 -0.090 0.490 -0.315 0.321 -0.022 0.466 0.385 0.393 0.112 NaN -0.252 0.186 -0.254 -0.012 0.028 0.155 0.094 0.022 -0.225 0.087 -0.071 0.257 -0.132 0.916 -0.188 0.246 0.513 -0.107 0.378 NaN 0.079 -0.368 0.063 -0.180 -0.644 NaN -0.503 -0.451 NaN 0.291 NaN 0.689 NaN -0.161 0.216 -0.165 NaN NaN 0.795 NaN 0.042 NaN NaN NaN -0.291 1.000 -0.310 -0.940 0.611 0.514 -0.297 0.401 1.000 -0.261
Base excess (arterial blood gas analysis) 0.570 0.033 -0.226 0.204 -0.196 -0.273 -0.296 0.162 0.040 -0.403 -0.344 -0.493 0.369 -0.607 0.357 -0.528 -0.241 -0.048 0.213 NaN 0.131 0.255 0.035 -0.104 -0.008 0.130 -0.132 -0.173 -0.086 -0.232 -0.212 -0.196 0.039 -0.616 0.453 0.389 -0.353 0.932 -0.335 NaN 0.859 0.491 0.876 0.061 0.442 NaN -0.156 -0.452 NaN -0.294 NaN -0.000 NaN -0.116 0.228 -0.252 NaN NaN 0.428 NaN -0.332 NaN NaN NaN -0.152 -0.310 1.000 0.602 0.550 0.644 -0.093 -0.262 -1.000 -0.256
pH (arterial blood gas analysis) 0.571 0.204 -0.179 -0.180 0.064 0.036 -0.525 -0.010 0.260 -0.545 -0.048 -0.563 0.431 -0.431 0.131 -0.510 -0.409 -0.318 0.121 NaN 0.363 0.008 0.214 -0.081 0.012 -0.082 -0.246 -0.198 0.201 -0.318 -0.095 -0.507 0.235 -0.834 0.325 0.023 -0.706 0.729 -0.552 NaN 0.503 0.685 0.530 0.143 0.684 NaN 0.380 0.282 NaN -0.281 NaN -0.401 NaN 0.062 -0.108 0.073 NaN NaN -0.462 NaN -0.167 NaN NaN NaN 0.180 -0.940 0.602 1.000 -0.322 -0.209 0.200 -0.387 -1.000 0.091
Total CO2 (arterial blood gas analysis) 0.086 -0.160 -0.113 0.425 -0.344 -0.419 0.200 0.262 -0.303 0.083 -0.411 0.043 0.054 -0.239 0.334 -0.020 0.126 0.367 0.207 NaN -0.041 0.329 -0.198 -0.070 0.018 0.309 -0.034 -0.090 -0.146 -0.148 -0.201 -0.044 -0.066 0.365 0.521 0.540 0.063 0.647 -0.028 NaN 0.742 0.068 0.740 -0.085 -0.288 NaN -0.557 -0.798 NaN 0.213 NaN 0.698 NaN -0.152 0.257 -0.324 NaN NaN 0.556 NaN -0.205 NaN NaN NaN -0.355 0.611 0.550 -0.322 1.000 0.993 -0.315 0.168 -1.000 -0.442
HCO3 (arterial blood gas analysis) 0.166 -0.133 -0.137 0.411 -0.340 -0.421 0.134 0.267 -0.268 0.013 -0.427 -0.032 0.101 -0.309 0.357 -0.098 0.076 0.331 0.212 NaN -0.019 0.326 -0.180 -0.074 0.012 0.303 -0.044 -0.100 -0.145 -0.158 -0.208 -0.057 -0.066 0.153 0.557 0.536 0.025 0.678 -0.060 NaN 0.762 0.101 0.761 -0.068 -0.206 NaN -0.543 -0.815 NaN 0.113 NaN 0.616 NaN -0.150 0.252 -0.320 NaN NaN 0.549 NaN -0.225 NaN NaN NaN -0.349 0.514 0.644 -0.209 0.993 1.000 -0.303 0.105 -1.000 -0.435
pO2 (arterial blood gas analysis) -0.098 0.106 -0.339 0.156 0.124 0.081 -0.138 -0.226 0.200 0.058 -0.067 -0.262 0.190 -0.210 0.134 -0.216 -0.009 -0.113 -0.157 NaN -0.211 -0.051 0.095 0.073 0.090 0.099 -0.214 -0.226 0.357 -0.056 -0.032 -0.071 0.107 -0.374 -0.213 0.486 -0.283 0.360 -0.340 NaN 0.501 -0.095 0.496 0.043 0.511 NaN 0.273 0.211 NaN 0.269 NaN -0.503 NaN 0.247 0.040 -0.331 NaN NaN -0.474 NaN 0.128 NaN NaN NaN 0.796 -0.297 -0.093 0.200 -0.315 -0.303 1.000 -0.194 -1.000 0.271
Arteiral Fio2 -0.335 -0.174 -0.091 0.348 0.066 -0.003 0.472 0.080 -0.258 0.227 -0.166 0.828 -0.322 0.422 -0.118 0.640 -0.123 0.402 -0.316 NaN 0.650 0.382 0.382 0.717 0.364 0.113 -0.201 -0.222 -0.745 0.420 0.398 0.334 0.180 0.081 0.168 0.092 0.833 -0.307 0.878 NaN -0.237 -0.464 -0.255 -0.083 -0.234 NaN 0.339 NaN NaN 0.874 NaN 0.940 NaN -0.554 0.105 0.003 NaN NaN -0.017 NaN 0.062 NaN NaN NaN -0.018 0.401 -0.262 -0.387 0.168 0.105 -0.194 1.000 1.000 -0.212
Phosphor -0.512 NaN 0.185 0.130 0.172 0.260 0.125 -0.222 0.190 0.109 0.478 0.322 -0.214 0.277 -0.308 0.068 0.043 0.125 0.139 NaN 0.146 -0.072 -0.307 -0.115 0.136 -0.413 -0.021 0.032 0.088 0.402 0.394 0.259 0.332 0.384 0.152 0.032 -0.108 -0.456 -0.118 NaN -0.336 -0.461 -0.353 -0.483 -0.318 NaN -0.538 -0.513 NaN -0.154 NaN -0.762 NaN 0.211 -0.030 -1.000 NaN -0.024 -0.071 0.310 1.000 NaN NaN 0.023 -1.000 1.000 -1.000 -1.000 -1.000 -1.000 -1.000 1.000 1.000 1.000
ctO2 (arterial blood gas analysis) -0.061 0.273 -0.049 -0.383 0.878 0.884 -0.483 0.018 0.848 -0.136 0.384 -0.201 -0.053 0.130 -0.433 -0.082 0.078 -0.316 -0.473 NaN -0.063 -0.550 -0.062 0.157 -0.132 -0.183 0.069 0.059 -0.082 0.254 0.068 0.416 0.071 -0.114 -0.691 -0.333 0.250 -0.813 0.294 NaN -0.741 -0.417 -0.753 0.035 0.483 NaN 0.367 0.031 NaN -0.211 NaN -0.051 NaN 0.161 0.000 -0.417 NaN NaN 0.079 NaN 0.124 NaN NaN NaN 0.128 -0.261 -0.256 0.091 -0.442 -0.435 0.271 -0.212 1.000 1.000

Observations:¶

  • There is strong correlation (-1.00) between serum bicarbonate (HCO3 arterial blood gas analysis) and serum Phosphorous.
  • There is strong correlation (-1.00) between pO2 (arterial blood gas analysis) and Phosphorous
  • There is a relatively strong correlation (0.97) between Hematocrit and Hemoglobin.
  • There is a relatively strong correlation (0.87) between Hematocrit and ctO2 (arterial blood gas analysis).
  • A strong correlation (1.00) exists between ctO2 and Phosphorus as well as between Phosphorous and Arterial FiO2.
  • There is a relatively strong correlation (0.99) between Total CO2 (arterial blood gas analysis) and HCO3 (arterial blood gas analysis)

8. Heatmap¶

In [25]:
plt.figure(figsize=(15, 7))
sns.heatmap(corr, annot=True, cmap="Spectral")
plt.show()

Insights¶

  1. This dataset requires some data cleaning to enable in-depth analysis.
  2. Age is a significant factor in the severity of presentation of covid-19. The older a patient is, the more likely he/she is to get admitted into the ICU.
  3. Covid-19 positive patients have a relatively lower pC02.
  4. There is no signifiant difference in the Urea or INR of covid-19 positive or negative patients.
  5. There is a strong negative correlation (-1.00) between serum bicarbonate (HCO3 arterial blood gas analysis) and serum Phosphorous, and pO2 (arterial blood gas analysis) and Phosphorous.
  6. A strong positive correlation (1.00) exists between ctO2 and Phosphorus as well as between Phosphorous and Arterial FiO2.

Recommendations¶

  1. This dataset holds a wealth of information concerning Covid-19 testing and predictive factors.

  2. Data cleaning is recommended to facilitate a comprehensive analysis that can produce business intelligence and predictive modeling insights.

#END OF MILESTONE 1¶

Milestone 2¶

Data Pre-processing¶

1. Renaming Columns¶

In [26]:
df.columns= df.columns.str.strip().str.replace('(','') #Remove brackets
df.columns= df.columns.str.strip().str.replace(')','')
df.columns= df.columns.str.replace('-','_') #Replace hyphens with underscore sign
df.columns= df.columns.str.replace(',','_') #Replace commas with underscore sign
df.columns= df.columns.str.replace('/','_') #Replace backward slash with underscore sign
df.columns= df.columns.str.replace(' ','_') #Replace all spaces in column names with underscore signs
df
Out[26]:
Patient_ID Patient_age_quantile SARS_Cov_2_exam_result Patient_addmited_to_regular_ward_1=yes__0=no Patient_addmited_to_semi_intensive_unit_1=yes__0=no Patient_addmited_to_intensive_care_unit_1=yes__0=no Hematocrit Hemoglobin Platelets Mean_platelet_volume Red_blood_Cells Lymphocytes Mean_corpuscular_hemoglobin_concentration MCHC Leukocytes Basophils Mean_corpuscular_hemoglobin_MCH Eosinophils Mean_corpuscular_volume_MCV Monocytes Red_blood_cell_distribution_width_RDW Serum_Glucose Respiratory_Syncytial_Virus Influenza_A Influenza_B Parainfluenza_1 CoronavirusNL63 Rhinovirus_Enterovirus Mycoplasma_pneumoniae Coronavirus_HKU1 Parainfluenza_3 Chlamydophila_pneumoniae Adenovirus Parainfluenza_4 Coronavirus229E CoronavirusOC43 Inf_A_H1N1_2009 Bordetella_pertussis Metapneumovirus Parainfluenza_2 Neutrophils Urea Proteina_C_reativa_mg_dL Creatinine Potassium Sodium Influenza_B__rapid_test Influenza_A__rapid_test Alanine_transaminase Aspartate_transaminase Gamma_glutamyltransferase Total_Bilirubin Direct_Bilirubin Indirect_Bilirubin Alkaline_phosphatase Ionized_calcium Strepto_A Magnesium pCO2_venous_blood_gas_analysis Hb_saturation_venous_blood_gas_analysis Base_excess_venous_blood_gas_analysis pO2_venous_blood_gas_analysis Fio2_venous_blood_gas_analysis Total_CO2_venous_blood_gas_analysis pH_venous_blood_gas_analysis HCO3_venous_blood_gas_analysis Rods_# Segmented Promyelocytes Metamyelocytes Myelocytes Myeloblasts Urine___Esterase Urine___Aspect Urine___pH Urine___Hemoglobin Urine___Bile_pigments Urine___Ketone_Bodies Urine___Nitrite Urine___Density Urine___Urobilinogen Urine___Protein Urine___Sugar Urine___Leukocytes Urine___Crystals Urine___Red_blood_cells Urine___Hyaline_cylinders Urine___Granular_cylinders Urine___Yeasts Urine___Color Partial_thromboplastin_time PTT Relationship_Patient_Normal International_normalized_ratio_INR Lactic_Dehydrogenase Prothrombin_time_PT__Activity Vitamin_B12 Creatine_phosphokinase CPK Ferritin Arterial_Lactic_Acid Lipase_dosage D_Dimer Albumin Hb_saturation_arterial_blood_gases pCO2_arterial_blood_gas_analysis Base_excess_arterial_blood_gas_analysis pH_arterial_blood_gas_analysis Total_CO2_arterial_blood_gas_analysis HCO3_arterial_blood_gas_analysis pO2_arterial_blood_gas_analysis Arteiral_Fio2 Phosphor ctO2_arterial_blood_gas_analysis
0 44477f75e8169d2 13 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 126e9dd13932f68 17 negative 0 0 0 0.237 -0.022 -0.517 0.011 0.102 0.318 -0.951 -0.095 -0.224 -0.292 1.482 0.166 0.358 -0.625 -0.141 not_detected not_detected not_detected not_detected not_detected detected NaN not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected -0.619 1.198 -0.148 2.090 -0.306 0.863 negative negative NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 a46b4402a0e5696 8 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 f7d619a94f97c45 5 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 d9e41465789c2b5 15 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN not_detected not_detected not_detected not_detected not_detected detected NaN not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5639 ae66feb9e4dc3a0 3 positive 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5640 517c2834024f3ea 17 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5641 5c57d6037fe266d 4 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5642 c20c44766f28291 10 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN clear 5 absent absent absent NaN -0.339 normal absent NaN 29000 Ausentes -0.177 absent absent absent yellow NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5643 2697fdccbfeb7f7 19 positive 0 0 0 0.694 0.542 -0.907 -0.326 0.578 -0.296 -0.353 -1.288 -1.140 -0.135 -0.836 0.026 0.568 -0.183 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.381 0.454 -0.504 -0.736 -0.553 -0.934 NaN NaN -0.284 0.109 -0.420 -0.481 -0.586 -0.279 -0.243 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.420 NaN NaN -0.343 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5644 rows × 111 columns

In [27]:
#Editing column names: Correcting grammatical errors
# Changing "addmited" to "admitted"
df.rename(columns={'Patient_addmited_to_regular_ward_1=yes__0=no': 'Patient_admitted_to_regular_ward_1=yes__0=no', 'Patient_addmited_to_semi_intensive_unit_1=yes__0=no': 'Patient_admitted_to_semi_intensive_unit_1=yes__0=no', 'Patient_addmited_to_intensive_care_unit_1=yes__0=no': 'Patient_admitted_to_intensive_care_unit_1=yes__0=no'}, inplace=True)
df
Out[27]:
Patient_ID Patient_age_quantile SARS_Cov_2_exam_result Patient_admitted_to_regular_ward_1=yes__0=no Patient_admitted_to_semi_intensive_unit_1=yes__0=no Patient_admitted_to_intensive_care_unit_1=yes__0=no Hematocrit Hemoglobin Platelets Mean_platelet_volume Red_blood_Cells Lymphocytes Mean_corpuscular_hemoglobin_concentration MCHC Leukocytes Basophils Mean_corpuscular_hemoglobin_MCH Eosinophils Mean_corpuscular_volume_MCV Monocytes Red_blood_cell_distribution_width_RDW Serum_Glucose Respiratory_Syncytial_Virus Influenza_A Influenza_B Parainfluenza_1 CoronavirusNL63 Rhinovirus_Enterovirus Mycoplasma_pneumoniae Coronavirus_HKU1 Parainfluenza_3 Chlamydophila_pneumoniae Adenovirus Parainfluenza_4 Coronavirus229E CoronavirusOC43 Inf_A_H1N1_2009 Bordetella_pertussis Metapneumovirus Parainfluenza_2 Neutrophils Urea Proteina_C_reativa_mg_dL Creatinine Potassium Sodium Influenza_B__rapid_test Influenza_A__rapid_test Alanine_transaminase Aspartate_transaminase Gamma_glutamyltransferase Total_Bilirubin Direct_Bilirubin Indirect_Bilirubin Alkaline_phosphatase Ionized_calcium Strepto_A Magnesium pCO2_venous_blood_gas_analysis Hb_saturation_venous_blood_gas_analysis Base_excess_venous_blood_gas_analysis pO2_venous_blood_gas_analysis Fio2_venous_blood_gas_analysis Total_CO2_venous_blood_gas_analysis pH_venous_blood_gas_analysis HCO3_venous_blood_gas_analysis Rods_# Segmented Promyelocytes Metamyelocytes Myelocytes Myeloblasts Urine___Esterase Urine___Aspect Urine___pH Urine___Hemoglobin Urine___Bile_pigments Urine___Ketone_Bodies Urine___Nitrite Urine___Density Urine___Urobilinogen Urine___Protein Urine___Sugar Urine___Leukocytes Urine___Crystals Urine___Red_blood_cells Urine___Hyaline_cylinders Urine___Granular_cylinders Urine___Yeasts Urine___Color Partial_thromboplastin_time PTT Relationship_Patient_Normal International_normalized_ratio_INR Lactic_Dehydrogenase Prothrombin_time_PT__Activity Vitamin_B12 Creatine_phosphokinase CPK Ferritin Arterial_Lactic_Acid Lipase_dosage D_Dimer Albumin Hb_saturation_arterial_blood_gases pCO2_arterial_blood_gas_analysis Base_excess_arterial_blood_gas_analysis pH_arterial_blood_gas_analysis Total_CO2_arterial_blood_gas_analysis HCO3_arterial_blood_gas_analysis pO2_arterial_blood_gas_analysis Arteiral_Fio2 Phosphor ctO2_arterial_blood_gas_analysis
0 44477f75e8169d2 13 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 126e9dd13932f68 17 negative 0 0 0 0.237 -0.022 -0.517 0.011 0.102 0.318 -0.951 -0.095 -0.224 -0.292 1.482 0.166 0.358 -0.625 -0.141 not_detected not_detected not_detected not_detected not_detected detected NaN not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected -0.619 1.198 -0.148 2.090 -0.306 0.863 negative negative NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 a46b4402a0e5696 8 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 f7d619a94f97c45 5 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 d9e41465789c2b5 15 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN not_detected not_detected not_detected not_detected not_detected detected NaN not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5639 ae66feb9e4dc3a0 3 positive 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5640 517c2834024f3ea 17 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5641 5c57d6037fe266d 4 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5642 c20c44766f28291 10 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN clear 5 absent absent absent NaN -0.339 normal absent NaN 29000 Ausentes -0.177 absent absent absent yellow NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5643 2697fdccbfeb7f7 19 positive 0 0 0 0.694 0.542 -0.907 -0.326 0.578 -0.296 -0.353 -1.288 -1.140 -0.135 -0.836 0.026 0.568 -0.183 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.381 0.454 -0.504 -0.736 -0.553 -0.934 NaN NaN -0.284 0.109 -0.420 -0.481 -0.586 -0.279 -0.243 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.420 NaN NaN -0.343 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5644 rows × 111 columns

Comments:¶

  • Column names have been changed to standard format, making it more uniform and understandable.
  • Grammatical errors have been corrected.

2. Removal of Unwanted Variables¶

  1. Patient_ID is the first to be dropped since it only serves as an identifier and does not affect covid-19 results. It will therefore not be needed in this data analysis.
  2. Other columns will be dropped based on an assessment of the extent of missing values.
  3. Columns that contain missing values but are not drooped will have their missing values treated
In [28]:
df_clean=df.drop('Patient_ID', axis=1, inplace=True)
In [29]:
#Deciding which columns to drop based on missing values
#Code to establish percentage of missing values per column
df_null_1=((df.isnull().sum())/5644*100).sort_values(ascending= False).head(60)
df_null_1
Out[29]:
Mycoplasma_pneumoniae                     100.000
Urine___Sugar                             100.000
Partial_thromboplastin_time PTT           100.000
Prothrombin_time_PT__Activity             100.000
D_Dimer                                   100.000
Fio2_venous_blood_gas_analysis             99.982
Urine___Nitrite                            99.982
Vitamin_B12                                99.947
Lipase_dosage                              99.858
Albumin                                    99.770
Arteiral_Fio2                              99.646
Phosphor                                   99.646
Ferritin                                   99.592
Hb_saturation_arterial_blood_gases         99.522
pCO2_arterial_blood_gas_analysis           99.522
Base_excess_arterial_blood_gas_analysis    99.522
pH_arterial_blood_gas_analysis             99.522
Arterial_Lactic_Acid                       99.522
Total_CO2_arterial_blood_gas_analysis      99.522
HCO3_arterial_blood_gas_analysis           99.522
pO2_arterial_blood_gas_analysis            99.522
ctO2_arterial_blood_gas_analysis           99.522
Magnesium                                  99.291
Ionized_calcium                            99.114
Urine___Ketone_Bodies                      98.990
Urine___Protein                            98.937
Urine___Esterase                           98.937
Urine___Hyaline_cylinders                  98.813
Urine___Granular_cylinders                 98.777
Urine___Urobilinogen                       98.777
Urine___pH                                 98.760
Urine___Hemoglobin                         98.760
Urine___Bile_pigments                      98.760
Urine___Color                              98.760
Urine___Density                            98.760
Urine___Leukocytes                         98.760
Urine___Crystals                           98.760
Urine___Red_blood_cells                    98.760
Urine___Yeasts                             98.760
Urine___Aspect                             98.760
Relationship_Patient_Normal                98.388
Myeloblasts                                98.281
Myelocytes                                 98.281
Metamyelocytes                             98.281
Segmented                                  98.281
Rods_#                                     98.281
Promyelocytes                              98.281
Lactic_Dehydrogenase                       98.210
Creatine_phosphokinase CPK                 98.157
International_normalized_ratio_INR         97.644
pCO2_venous_blood_gas_analysis             97.590
Base_excess_venous_blood_gas_analysis      97.590
HCO3_venous_blood_gas_analysis             97.590
pH_venous_blood_gas_analysis               97.590
Total_CO2_venous_blood_gas_analysis        97.590
pO2_venous_blood_gas_analysis              97.590
Hb_saturation_venous_blood_gas_analysis    97.590
Alkaline_phosphatase                       97.449
Gamma_glutamyltransferase                  97.289
Indirect_Bilirubin                         96.775
dtype: float64

Observations:¶

  • The above listed columns consist of more missing values than useful data.
  • Four columns have 100% missing values. These are;
  1. Urine-Sugar
  2. Mycoplasma pneumonia
  3. Partial thromboplastin time (PTT)
  4. Prothrombin time (PT), Activity
In [30]:
#Code to determine columns with least number of missing values
df_null_2=((df.isnull().sum())/5644*100).sort_values(ascending= True).head(51)
df_null_2
Out[30]:
Patient_age_quantile                                   0.000
SARS_Cov_2_exam_result                                 0.000
Patient_admitted_to_regular_ward_1=yes__0=no           0.000
Patient_admitted_to_semi_intensive_unit_1=yes__0=no    0.000
Patient_admitted_to_intensive_care_unit_1=yes__0=no    0.000
Influenza_B                                           76.010
Respiratory_Syncytial_Virus                           76.010
Influenza_A                                           76.010
Rhinovirus_Enterovirus                                76.045
Inf_A_H1N1_2009                                       76.045
CoronavirusOC43                                       76.045
Coronavirus229E                                       76.045
Parainfluenza_4                                       76.045
Adenovirus                                            76.045
Chlamydophila_pneumoniae                              76.045
Parainfluenza_3                                       76.045
Coronavirus_HKU1                                      76.045
CoronavirusNL63                                       76.045
Parainfluenza_1                                       76.045
Bordetella_pertussis                                  76.045
Parainfluenza_2                                       76.045
Metapneumovirus                                       76.045
Influenza_A__rapid_test                               85.471
Influenza_B__rapid_test                               85.471
Hemoglobin                                            89.316
Hematocrit                                            89.316
Red_blood_cell_distribution_width_RDW                 89.334
Platelets                                             89.334
Mean_corpuscular_volume_MCV                           89.334
Eosinophils                                           89.334
Mean_corpuscular_hemoglobin_MCH                       89.334
Basophils                                             89.334
Leukocytes                                            89.334
Mean_corpuscular_hemoglobin_concentration MCHC        89.334
Lymphocytes                                           89.334
Red_blood_Cells                                       89.334
Monocytes                                             89.352
Mean_platelet_volume                                  89.387
Neutrophils                                           90.911
Proteina_C_reativa_mg_dL                              91.035
Creatinine                                            92.488
Urea                                                  92.966
Potassium                                             93.427
Sodium                                                93.444
Strepto_A                                             94.118
Aspartate_transaminase                                95.996
Alanine_transaminase                                  96.013
Serum_Glucose                                         96.315
Total_Bilirubin                                       96.775
Direct_Bilirubin                                      96.775
Indirect_Bilirubin                                    96.775
dtype: float64

Observation¶

  • The columns with no missing data are;
  1. Patient age quantile
  2. SARS-Cov-2 exam result
  3. Patient addmited to regular ward (1=yes, 0=no)
  4. Patient addmited to semi-intensive unit (1=yes, 0=no)
  5. Patient addmited to intensive care unit (1=yes, 0=no)
In [31]:
#Dropping columns with more than 98% of missing values
df_clean1=df.loc[:,df.isnull().mean()<0.90]
df_clean1
Out[31]:
Patient_age_quantile SARS_Cov_2_exam_result Patient_admitted_to_regular_ward_1=yes__0=no Patient_admitted_to_semi_intensive_unit_1=yes__0=no Patient_admitted_to_intensive_care_unit_1=yes__0=no Hematocrit Hemoglobin Platelets Mean_platelet_volume Red_blood_Cells Lymphocytes Mean_corpuscular_hemoglobin_concentration MCHC Leukocytes Basophils Mean_corpuscular_hemoglobin_MCH Eosinophils Mean_corpuscular_volume_MCV Monocytes Red_blood_cell_distribution_width_RDW Respiratory_Syncytial_Virus Influenza_A Influenza_B Parainfluenza_1 CoronavirusNL63 Rhinovirus_Enterovirus Coronavirus_HKU1 Parainfluenza_3 Chlamydophila_pneumoniae Adenovirus Parainfluenza_4 Coronavirus229E CoronavirusOC43 Inf_A_H1N1_2009 Bordetella_pertussis Metapneumovirus Parainfluenza_2 Influenza_B__rapid_test Influenza_A__rapid_test
0 13 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 17 negative 0 0 0 0.237 -0.022 -0.517 0.011 0.102 0.318 -0.951 -0.095 -0.224 -0.292 1.482 0.166 0.358 -0.625 not_detected not_detected not_detected not_detected not_detected detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected negative negative
2 8 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 5 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 15 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN not_detected not_detected not_detected not_detected not_detected detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5639 3 positive 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5640 17 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5641 4 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5642 10 negative 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5643 19 positive 0 0 0 0.694 0.542 -0.907 -0.326 0.578 -0.296 -0.353 -1.288 -1.140 -0.135 -0.836 0.026 0.568 -0.183 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5644 rows × 38 columns

Comments:¶

  • Patient_ID has been dropped
  • 72 other columns/variables have also been dropped on account of total or near total (more than 90) missing values.
  • A total of 73 columns/variables out of 111 have been removed.
  • 38 columns now remain.

3. Missing value treatment¶

  1. Numerical data columns will have their missing data replaced by their respective medians
  2. Missing categorical data will be replaced by 'Unknown'

A) Treating missing numerical data¶

In [32]:
#Code to fill in missing values of numeric data with a median
numeric_columns = df_clean1.select_dtypes(include=np.number).columns
df_clean1[numeric_columns]=df_clean1[numeric_columns].fillna(df_clean1[numeric_columns].median())
df_clean1
Out[32]:
Patient_age_quantile SARS_Cov_2_exam_result Patient_admitted_to_regular_ward_1=yes__0=no Patient_admitted_to_semi_intensive_unit_1=yes__0=no Patient_admitted_to_intensive_care_unit_1=yes__0=no Hematocrit Hemoglobin Platelets Mean_platelet_volume Red_blood_Cells Lymphocytes Mean_corpuscular_hemoglobin_concentration MCHC Leukocytes Basophils Mean_corpuscular_hemoglobin_MCH Eosinophils Mean_corpuscular_volume_MCV Monocytes Red_blood_cell_distribution_width_RDW Respiratory_Syncytial_Virus Influenza_A Influenza_B Parainfluenza_1 CoronavirusNL63 Rhinovirus_Enterovirus Coronavirus_HKU1 Parainfluenza_3 Chlamydophila_pneumoniae Adenovirus Parainfluenza_4 Coronavirus229E CoronavirusOC43 Inf_A_H1N1_2009 Bordetella_pertussis Metapneumovirus Parainfluenza_2 Influenza_B__rapid_test Influenza_A__rapid_test
0 13 negative 0 0 0 0.053 0.040 -0.122 -0.102 0.014 -0.014 -0.055 -0.213 -0.224 0.126 -0.330 0.066 -0.115 -0.183 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 17 negative 0 0 0 0.237 -0.022 -0.517 0.011 0.102 0.318 -0.951 -0.095 -0.224 -0.292 1.482 0.166 0.358 -0.625 not_detected not_detected not_detected not_detected not_detected detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected negative negative
2 8 negative 0 0 0 0.053 0.040 -0.122 -0.102 0.014 -0.014 -0.055 -0.213 -0.224 0.126 -0.330 0.066 -0.115 -0.183 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 5 negative 0 0 0 0.053 0.040 -0.122 -0.102 0.014 -0.014 -0.055 -0.213 -0.224 0.126 -0.330 0.066 -0.115 -0.183 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 15 negative 0 0 0 0.053 0.040 -0.122 -0.102 0.014 -0.014 -0.055 -0.213 -0.224 0.126 -0.330 0.066 -0.115 -0.183 not_detected not_detected not_detected not_detected not_detected detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5639 3 positive 0 0 0 0.053 0.040 -0.122 -0.102 0.014 -0.014 -0.055 -0.213 -0.224 0.126 -0.330 0.066 -0.115 -0.183 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5640 17 negative 0 0 0 0.053 0.040 -0.122 -0.102 0.014 -0.014 -0.055 -0.213 -0.224 0.126 -0.330 0.066 -0.115 -0.183 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5641 4 negative 0 0 0 0.053 0.040 -0.122 -0.102 0.014 -0.014 -0.055 -0.213 -0.224 0.126 -0.330 0.066 -0.115 -0.183 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5642 10 negative 0 0 0 0.053 0.040 -0.122 -0.102 0.014 -0.014 -0.055 -0.213 -0.224 0.126 -0.330 0.066 -0.115 -0.183 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5643 19 positive 0 0 0 0.694 0.542 -0.907 -0.326 0.578 -0.296 -0.353 -1.288 -1.140 -0.135 -0.836 0.026 0.568 -0.183 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5644 rows × 38 columns

B) Treating missing categorical data¶

In [33]:
# filling with Unknown class
cat_cols= ['Respiratory_Syncytial_Virus','Influenza_A','Influenza_B','Parainfluenza_1',
           'CoronavirusNL63','Rhinovirus_Enterovirus','Coronavirus_HKU1','Parainfluenza_3',
           'Chlamydophila_pneumoniae','Adenovirus','Parainfluenza_4','Coronavirus229E', 
           'CoronavirusOC43', 'Inf_A_H1N1_2009', 'Bordetella_pertussis','Metapneumovirus',
           'Parainfluenza_2','Influenza_B__rapid_test','Influenza_A__rapid_test'
          ]
df_clean1[cat_cols] = df_clean1[cat_cols].fillna("Unknown")
df_clean1
Out[33]:
Patient_age_quantile SARS_Cov_2_exam_result Patient_admitted_to_regular_ward_1=yes__0=no Patient_admitted_to_semi_intensive_unit_1=yes__0=no Patient_admitted_to_intensive_care_unit_1=yes__0=no Hematocrit Hemoglobin Platelets Mean_platelet_volume Red_blood_Cells Lymphocytes Mean_corpuscular_hemoglobin_concentration MCHC Leukocytes Basophils Mean_corpuscular_hemoglobin_MCH Eosinophils Mean_corpuscular_volume_MCV Monocytes Red_blood_cell_distribution_width_RDW Respiratory_Syncytial_Virus Influenza_A Influenza_B Parainfluenza_1 CoronavirusNL63 Rhinovirus_Enterovirus Coronavirus_HKU1 Parainfluenza_3 Chlamydophila_pneumoniae Adenovirus Parainfluenza_4 Coronavirus229E CoronavirusOC43 Inf_A_H1N1_2009 Bordetella_pertussis Metapneumovirus Parainfluenza_2 Influenza_B__rapid_test Influenza_A__rapid_test
0 13 negative 0 0 0 0.053 0.040 -0.122 -0.102 0.014 -0.014 -0.055 -0.213 -0.224 0.126 -0.330 0.066 -0.115 -0.183 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
1 17 negative 0 0 0 0.237 -0.022 -0.517 0.011 0.102 0.318 -0.951 -0.095 -0.224 -0.292 1.482 0.166 0.358 -0.625 not_detected not_detected not_detected not_detected not_detected detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected negative negative
2 8 negative 0 0 0 0.053 0.040 -0.122 -0.102 0.014 -0.014 -0.055 -0.213 -0.224 0.126 -0.330 0.066 -0.115 -0.183 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
3 5 negative 0 0 0 0.053 0.040 -0.122 -0.102 0.014 -0.014 -0.055 -0.213 -0.224 0.126 -0.330 0.066 -0.115 -0.183 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
4 15 negative 0 0 0 0.053 0.040 -0.122 -0.102 0.014 -0.014 -0.055 -0.213 -0.224 0.126 -0.330 0.066 -0.115 -0.183 not_detected not_detected not_detected not_detected not_detected detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected Unknown Unknown
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5639 3 positive 0 0 0 0.053 0.040 -0.122 -0.102 0.014 -0.014 -0.055 -0.213 -0.224 0.126 -0.330 0.066 -0.115 -0.183 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
5640 17 negative 0 0 0 0.053 0.040 -0.122 -0.102 0.014 -0.014 -0.055 -0.213 -0.224 0.126 -0.330 0.066 -0.115 -0.183 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
5641 4 negative 0 0 0 0.053 0.040 -0.122 -0.102 0.014 -0.014 -0.055 -0.213 -0.224 0.126 -0.330 0.066 -0.115 -0.183 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
5642 10 negative 0 0 0 0.053 0.040 -0.122 -0.102 0.014 -0.014 -0.055 -0.213 -0.224 0.126 -0.330 0.066 -0.115 -0.183 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
5643 19 positive 0 0 0 0.694 0.542 -0.907 -0.326 0.578 -0.296 -0.353 -1.288 -1.140 -0.135 -0.836 0.026 0.568 -0.183 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown

5644 rows × 38 columns

In [34]:
df_clean1.isna().sum()
Out[34]:
Patient_age_quantile                                   0
SARS_Cov_2_exam_result                                 0
Patient_admitted_to_regular_ward_1=yes__0=no           0
Patient_admitted_to_semi_intensive_unit_1=yes__0=no    0
Patient_admitted_to_intensive_care_unit_1=yes__0=no    0
Hematocrit                                             0
Hemoglobin                                             0
Platelets                                              0
Mean_platelet_volume                                   0
Red_blood_Cells                                        0
Lymphocytes                                            0
Mean_corpuscular_hemoglobin_concentration MCHC         0
Leukocytes                                             0
Basophils                                              0
Mean_corpuscular_hemoglobin_MCH                        0
Eosinophils                                            0
Mean_corpuscular_volume_MCV                            0
Monocytes                                              0
Red_blood_cell_distribution_width_RDW                  0
Respiratory_Syncytial_Virus                            0
Influenza_A                                            0
Influenza_B                                            0
Parainfluenza_1                                        0
CoronavirusNL63                                        0
Rhinovirus_Enterovirus                                 0
Coronavirus_HKU1                                       0
Parainfluenza_3                                        0
Chlamydophila_pneumoniae                               0
Adenovirus                                             0
Parainfluenza_4                                        0
Coronavirus229E                                        0
CoronavirusOC43                                        0
Inf_A_H1N1_2009                                        0
Bordetella_pertussis                                   0
Metapneumovirus                                        0
Parainfluenza_2                                        0
Influenza_B__rapid_test                                0
Influenza_A__rapid_test                                0
dtype: int64

Observations:¶

  • All missing values have been treated.
  • Numerical variables were treated with their median value.
  • Categorical variables were treaated with their mode.

4. Outlier Treatment¶

In [35]:
# outlier detection using boxplot
plt.figure(figsize=(30, 100))

for i, variable in enumerate(numeric_columns):
    plt.subplot(20, 3, i + 1)
    plt.boxplot(df_clean1[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

Observation:¶

  • Columns with distinct outliers are;
  1. Platelets
  2. Basophils
  3. Eosinophils
  4. Monocytes
  5. Red_blood_cell_distribution_width_RDW
  • These columns will have their outliers treated.
  • The remaining numerical columns either;
  1. Do not have distinct outliers or
  2. Their outliers are considered significant and thus,will not be treated
In [36]:
# functions to treat outliers by flooring and capping


def treat_outliers(df, col):
    """
    Treats outliers in a variable

    df: dataframe
    col: dataframe column
    """
    Q1 = df[col].quantile(0.25)  # 25th quantile
    Q3 = df[col].quantile(0.75)  # 75th quantile
    IQR = Q3 - Q1
    Lower_Whisker = Q1 - 1.5 * IQR
    Upper_Whisker = Q3 + 1.5 * IQR

    # all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
    # all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
    df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)

    return df


def treat_outliers_all(df, col_list):
    """
    Treat outliers in a list of variables

    df: dataframe
    col_list: list of dataframe columns
    """
    for c in col_list:
        df = treat_outliers(df, c)

    return df
In [37]:
treat_out_cols = [
 'Platelets',
'Basophils',
'Eosinophils',
'Monocytes',
'Red_blood_cell_distribution_width_RDW'
]

df_clean2 = treat_outliers_all(df_clean1, treat_out_cols)
In [38]:
# Checking to see result of outlier treatment

plt.figure(figsize=(9, 15))

for i, variable in enumerate(treat_out_cols):
    plt.subplot(3, 4, i + 1)
    plt.boxplot(df_clean2[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

Observation:¶

Selected columns have been treated

5. Dropping variables on account of number of unique variables¶

In [39]:
df_clean2.nunique()
Out[39]:
Patient_age_quantile                                    20
SARS_Cov_2_exam_result                                   2
Patient_admitted_to_regular_ward_1=yes__0=no             2
Patient_admitted_to_semi_intensive_unit_1=yes__0=no      2
Patient_admitted_to_intensive_care_unit_1=yes__0=no      2
Hematocrit                                             176
Hemoglobin                                              84
Platelets                                                1
Mean_platelet_volume                                    48
Red_blood_Cells                                        211
Lymphocytes                                            318
Mean_corpuscular_hemoglobin_concentration MCHC          57
Leukocytes                                             476
Basophils                                                1
Mean_corpuscular_hemoglobin_MCH                         91
Eosinophils                                              1
Mean_corpuscular_volume_MCV                            190
Monocytes                                                1
Red_blood_cell_distribution_width_RDW                    1
Respiratory_Syncytial_Virus                              3
Influenza_A                                              3
Influenza_B                                              3
Parainfluenza_1                                          3
CoronavirusNL63                                          3
Rhinovirus_Enterovirus                                   3
Coronavirus_HKU1                                         3
Parainfluenza_3                                          3
Chlamydophila_pneumoniae                                 3
Adenovirus                                               3
Parainfluenza_4                                          3
Coronavirus229E                                          3
CoronavirusOC43                                          3
Inf_A_H1N1_2009                                          3
Bordetella_pertussis                                     3
Metapneumovirus                                          3
Parainfluenza_2                                          2
Influenza_B__rapid_test                                  3
Influenza_A__rapid_test                                  3
dtype: int64

Observation:¶

Five columns consist of only one value. These columns are:

  • Platelets
  • Red_blood_cell_distribution_width_RDW
  • Monocytes
  • Basophils
  • Eosinophils

One value is not enough data for developing meaningful insights. These columns will therefore be dropped.

In [40]:
df_clean3 = df_clean2.drop(
    ["Red_blood_cell_distribution_width_RDW", "Monocytes", "Basophils", "Eosinophils", "Platelets"], axis=1
)
df_clean3.head()
Out[40]:
Patient_age_quantile SARS_Cov_2_exam_result Patient_admitted_to_regular_ward_1=yes__0=no Patient_admitted_to_semi_intensive_unit_1=yes__0=no Patient_admitted_to_intensive_care_unit_1=yes__0=no Hematocrit Hemoglobin Mean_platelet_volume Red_blood_Cells Lymphocytes Mean_corpuscular_hemoglobin_concentration MCHC Leukocytes Mean_corpuscular_hemoglobin_MCH Mean_corpuscular_volume_MCV Respiratory_Syncytial_Virus Influenza_A Influenza_B Parainfluenza_1 CoronavirusNL63 Rhinovirus_Enterovirus Coronavirus_HKU1 Parainfluenza_3 Chlamydophila_pneumoniae Adenovirus Parainfluenza_4 Coronavirus229E CoronavirusOC43 Inf_A_H1N1_2009 Bordetella_pertussis Metapneumovirus Parainfluenza_2 Influenza_B__rapid_test Influenza_A__rapid_test
0 13 negative 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
1 17 negative 0 0 0 0.237 -0.022 0.011 0.102 0.318 -0.951 -0.095 -0.292 0.166 not_detected not_detected not_detected not_detected not_detected detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected negative negative
2 8 negative 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
3 5 negative 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
4 15 negative 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 not_detected not_detected not_detected not_detected not_detected detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected Unknown Unknown

Observation:¶

  • The 5 columns containing only one value have been dropped.
  • There are now 33 variables in the dataset

6. Variable transformation¶

A) Replace values in SARS_Cov_2_exam_result column with binary values (0/1)¶

In [41]:
# Replace entries in 'SARS_Cov_2_exam_result' with zeros and ones
replaceStruct = {"SARS_Cov_2_exam_result": {"negative":0, 'positive':1}}
df_model = df_clean3.replace(replaceStruct)
df_model.head()
Out[41]:
Patient_age_quantile SARS_Cov_2_exam_result Patient_admitted_to_regular_ward_1=yes__0=no Patient_admitted_to_semi_intensive_unit_1=yes__0=no Patient_admitted_to_intensive_care_unit_1=yes__0=no Hematocrit Hemoglobin Mean_platelet_volume Red_blood_Cells Lymphocytes Mean_corpuscular_hemoglobin_concentration MCHC Leukocytes Mean_corpuscular_hemoglobin_MCH Mean_corpuscular_volume_MCV Respiratory_Syncytial_Virus Influenza_A Influenza_B Parainfluenza_1 CoronavirusNL63 Rhinovirus_Enterovirus Coronavirus_HKU1 Parainfluenza_3 Chlamydophila_pneumoniae Adenovirus Parainfluenza_4 Coronavirus229E CoronavirusOC43 Inf_A_H1N1_2009 Bordetella_pertussis Metapneumovirus Parainfluenza_2 Influenza_B__rapid_test Influenza_A__rapid_test
0 13 0 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
1 17 0 0 0 0 0.237 -0.022 0.011 0.102 0.318 -0.951 -0.095 -0.292 0.166 not_detected not_detected not_detected not_detected not_detected detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected negative negative
2 8 0 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
3 5 0 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
4 15 0 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 not_detected not_detected not_detected not_detected not_detected detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected Unknown Unknown

Observation:¶

Values in the 'SARS_Cov_2_exam_result' have been replaced with zeros and ones

B) Changing object data types to categorical¶

In [42]:
df_model[cat_cols]=df_model[cat_cols].astype('category')

Comment¶

Object data types have been transformed into categorical

Exploratory Data Analysis¶

Aim: To evaluate dataset after manipulation¶

In [43]:
df_model.head()
Out[43]:
Patient_age_quantile SARS_Cov_2_exam_result Patient_admitted_to_regular_ward_1=yes__0=no Patient_admitted_to_semi_intensive_unit_1=yes__0=no Patient_admitted_to_intensive_care_unit_1=yes__0=no Hematocrit Hemoglobin Mean_platelet_volume Red_blood_Cells Lymphocytes Mean_corpuscular_hemoglobin_concentration MCHC Leukocytes Mean_corpuscular_hemoglobin_MCH Mean_corpuscular_volume_MCV Respiratory_Syncytial_Virus Influenza_A Influenza_B Parainfluenza_1 CoronavirusNL63 Rhinovirus_Enterovirus Coronavirus_HKU1 Parainfluenza_3 Chlamydophila_pneumoniae Adenovirus Parainfluenza_4 Coronavirus229E CoronavirusOC43 Inf_A_H1N1_2009 Bordetella_pertussis Metapneumovirus Parainfluenza_2 Influenza_B__rapid_test Influenza_A__rapid_test
0 13 0 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
1 17 0 0 0 0 0.237 -0.022 0.011 0.102 0.318 -0.951 -0.095 -0.292 0.166 not_detected not_detected not_detected not_detected not_detected detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected negative negative
2 8 0 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
3 5 0 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
4 15 0 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 not_detected not_detected not_detected not_detected not_detected detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected not_detected Unknown Unknown

Comment:¶

Column names have been revised from that of the original dataset

In [44]:
# viewing a random sample of the dataset
df_model.sample(n=10, random_state=1)
Out[44]:
Patient_age_quantile SARS_Cov_2_exam_result Patient_admitted_to_regular_ward_1=yes__0=no Patient_admitted_to_semi_intensive_unit_1=yes__0=no Patient_admitted_to_intensive_care_unit_1=yes__0=no Hematocrit Hemoglobin Mean_platelet_volume Red_blood_Cells Lymphocytes Mean_corpuscular_hemoglobin_concentration MCHC Leukocytes Mean_corpuscular_hemoglobin_MCH Mean_corpuscular_volume_MCV Respiratory_Syncytial_Virus Influenza_A Influenza_B Parainfluenza_1 CoronavirusNL63 Rhinovirus_Enterovirus Coronavirus_HKU1 Parainfluenza_3 Chlamydophila_pneumoniae Adenovirus Parainfluenza_4 Coronavirus229E CoronavirusOC43 Inf_A_H1N1_2009 Bordetella_pertussis Metapneumovirus Parainfluenza_2 Influenza_B__rapid_test Influenza_A__rapid_test
4441 12 0 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
1603 1 0 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
1206 10 0 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
1586 6 1 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
2730 16 0 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
3205 9 0 0 0 0 0.191 0.228 -0.438 0.031 1.461 0.244 0.573 0.283 0.226 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown negative negative
5321 10 0 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
943 17 0 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
5029 10 0 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
1998 1 0 0 0 0 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown

Comment¶

A sample of 10 random cases to evaluate changes

In [45]:
df_model.shape
Out[45]:
(5644, 33)

Observation:¶

There are 33 columns and 5644 rows

In [46]:
df_model.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5644 entries, 0 to 5643
Data columns (total 33 columns):
 #   Column                                               Non-Null Count  Dtype   
---  ------                                               --------------  -----   
 0   Patient_age_quantile                                 5644 non-null   int64   
 1   SARS_Cov_2_exam_result                               5644 non-null   int64   
 2   Patient_admitted_to_regular_ward_1=yes__0=no         5644 non-null   int64   
 3   Patient_admitted_to_semi_intensive_unit_1=yes__0=no  5644 non-null   int64   
 4   Patient_admitted_to_intensive_care_unit_1=yes__0=no  5644 non-null   int64   
 5   Hematocrit                                           5644 non-null   float64 
 6   Hemoglobin                                           5644 non-null   float64 
 7   Mean_platelet_volume                                 5644 non-null   float64 
 8   Red_blood_Cells                                      5644 non-null   float64 
 9   Lymphocytes                                          5644 non-null   float64 
 10  Mean_corpuscular_hemoglobin_concentration MCHC       5644 non-null   float64 
 11  Leukocytes                                           5644 non-null   float64 
 12  Mean_corpuscular_hemoglobin_MCH                      5644 non-null   float64 
 13  Mean_corpuscular_volume_MCV                          5644 non-null   float64 
 14  Respiratory_Syncytial_Virus                          5644 non-null   category
 15  Influenza_A                                          5644 non-null   category
 16  Influenza_B                                          5644 non-null   category
 17  Parainfluenza_1                                      5644 non-null   category
 18  CoronavirusNL63                                      5644 non-null   category
 19  Rhinovirus_Enterovirus                               5644 non-null   category
 20  Coronavirus_HKU1                                     5644 non-null   category
 21  Parainfluenza_3                                      5644 non-null   category
 22  Chlamydophila_pneumoniae                             5644 non-null   category
 23  Adenovirus                                           5644 non-null   category
 24  Parainfluenza_4                                      5644 non-null   category
 25  Coronavirus229E                                      5644 non-null   category
 26  CoronavirusOC43                                      5644 non-null   category
 27  Inf_A_H1N1_2009                                      5644 non-null   category
 28  Bordetella_pertussis                                 5644 non-null   category
 29  Metapneumovirus                                      5644 non-null   category
 30  Parainfluenza_2                                      5644 non-null   category
 31  Influenza_B__rapid_test                              5644 non-null   category
 32  Influenza_A__rapid_test                              5644 non-null   category
dtypes: category(19), float64(9), int64(5)
memory usage: 724.6 KB
In [47]:
df_model.nunique()
Out[47]:
Patient_age_quantile                                    20
SARS_Cov_2_exam_result                                   2
Patient_admitted_to_regular_ward_1=yes__0=no             2
Patient_admitted_to_semi_intensive_unit_1=yes__0=no      2
Patient_admitted_to_intensive_care_unit_1=yes__0=no      2
Hematocrit                                             176
Hemoglobin                                              84
Mean_platelet_volume                                    48
Red_blood_Cells                                        211
Lymphocytes                                            318
Mean_corpuscular_hemoglobin_concentration MCHC          57
Leukocytes                                             476
Mean_corpuscular_hemoglobin_MCH                         91
Mean_corpuscular_volume_MCV                            190
Respiratory_Syncytial_Virus                              3
Influenza_A                                              3
Influenza_B                                              3
Parainfluenza_1                                          3
CoronavirusNL63                                          3
Rhinovirus_Enterovirus                                   3
Coronavirus_HKU1                                         3
Parainfluenza_3                                          3
Chlamydophila_pneumoniae                                 3
Adenovirus                                               3
Parainfluenza_4                                          3
Coronavirus229E                                          3
CoronavirusOC43                                          3
Inf_A_H1N1_2009                                          3
Bordetella_pertussis                                     3
Metapneumovirus                                          3
Parainfluenza_2                                          2
Influenza_B__rapid_test                                  3
Influenza_A__rapid_test                                  3
dtype: int64

Comment¶

Summary of unique values per column

In [48]:
df_model.isna().sum()
Out[48]:
Patient_age_quantile                                   0
SARS_Cov_2_exam_result                                 0
Patient_admitted_to_regular_ward_1=yes__0=no           0
Patient_admitted_to_semi_intensive_unit_1=yes__0=no    0
Patient_admitted_to_intensive_care_unit_1=yes__0=no    0
Hematocrit                                             0
Hemoglobin                                             0
Mean_platelet_volume                                   0
Red_blood_Cells                                        0
Lymphocytes                                            0
Mean_corpuscular_hemoglobin_concentration MCHC         0
Leukocytes                                             0
Mean_corpuscular_hemoglobin_MCH                        0
Mean_corpuscular_volume_MCV                            0
Respiratory_Syncytial_Virus                            0
Influenza_A                                            0
Influenza_B                                            0
Parainfluenza_1                                        0
CoronavirusNL63                                        0
Rhinovirus_Enterovirus                                 0
Coronavirus_HKU1                                       0
Parainfluenza_3                                        0
Chlamydophila_pneumoniae                               0
Adenovirus                                             0
Parainfluenza_4                                        0
Coronavirus229E                                        0
CoronavirusOC43                                        0
Inf_A_H1N1_2009                                        0
Bordetella_pertussis                                   0
Metapneumovirus                                        0
Parainfluenza_2                                        0
Influenza_B__rapid_test                                0
Influenza_A__rapid_test                                0
dtype: int64

Comment¶

There are no missing values

In [49]:
df_model.describe().T
Out[49]:
count mean std min 25% 50% 75% max
Patient_age_quantile 5644.000 9.318 5.778 0.000 4.000 9.000 14.000 19.000
SARS_Cov_2_exam_result 5644.000 0.099 0.299 0.000 0.000 0.000 0.000 1.000
Patient_admitted_to_regular_ward_1=yes__0=no 5644.000 0.014 0.117 0.000 0.000 0.000 0.000 1.000
Patient_admitted_to_semi_intensive_unit_1=yes__0=no 5644.000 0.009 0.094 0.000 0.000 0.000 0.000 1.000
Patient_admitted_to_intensive_care_unit_1=yes__0=no 5644.000 0.007 0.085 0.000 0.000 0.000 0.000 1.000
Hematocrit 5644.000 0.048 0.327 -4.501 0.053 0.053 0.053 2.663
Hemoglobin 5644.000 0.036 0.327 -4.346 0.040 0.040 0.040 2.672
Mean_platelet_volume 5644.000 -0.091 0.327 -2.458 -0.102 -0.102 -0.102 3.713
Red_blood_Cells 5644.000 0.012 0.327 -3.971 0.014 0.014 0.014 3.646
Lymphocytes 5644.000 -0.013 0.327 -1.865 -0.014 -0.014 -0.014 3.764
Mean_corpuscular_hemoglobin_concentration MCHC 5644.000 -0.049 0.327 -5.432 -0.055 -0.055 -0.055 3.331
Leukocytes 5644.000 -0.190 0.333 -2.020 -0.213 -0.213 -0.213 4.522
Mean_corpuscular_hemoglobin_MCH 5644.000 0.112 0.329 -5.938 0.126 0.126 0.126 4.099
Mean_corpuscular_volume_MCV 5644.000 0.059 0.327 -5.102 0.066 0.066 0.066 3.411

Comment¶

Statistical table now provides a more comprehensive insight into the dataset

In [50]:
# Code for correlation table
corr2= df_model.corr()
corr2
Out[50]:
Patient_age_quantile SARS_Cov_2_exam_result Patient_admitted_to_regular_ward_1=yes__0=no Patient_admitted_to_semi_intensive_unit_1=yes__0=no Patient_admitted_to_intensive_care_unit_1=yes__0=no Hematocrit Hemoglobin Mean_platelet_volume Red_blood_Cells Lymphocytes Mean_corpuscular_hemoglobin_concentration MCHC Leukocytes Mean_corpuscular_hemoglobin_MCH Mean_corpuscular_volume_MCV
Patient_age_quantile 1.000 0.075 0.046 0.016 -0.036 0.026 0.015 0.049 -0.014 -0.039 -0.034 -0.031 0.050 0.084
SARS_Cov_2_exam_result 0.075 1.000 0.142 0.019 0.028 0.035 0.038 0.044 0.045 -0.005 0.020 -0.098 -0.016 -0.024
Patient_admitted_to_regular_ward_1=yes__0=no 0.046 0.142 1.000 -0.011 -0.010 -0.084 -0.085 0.012 -0.047 -0.075 -0.016 -0.035 -0.070 -0.047
Patient_admitted_to_semi_intensive_unit_1=yes__0=no 0.016 0.019 -0.011 1.000 -0.008 -0.173 -0.166 0.001 -0.125 -0.095 -0.009 0.165 -0.075 -0.059
Patient_admitted_to_intensive_care_unit_1=yes__0=no -0.036 0.028 -0.010 -0.008 1.000 -0.160 -0.154 -0.044 -0.102 -0.088 -0.021 0.252 -0.094 -0.075
Hematocrit 0.026 0.035 -0.084 -0.173 -0.160 1.000 0.968 0.078 0.872 0.001 0.128 -0.098 0.081 0.028
Hemoglobin 0.015 0.038 -0.085 -0.166 -0.154 0.968 1.000 0.075 0.841 -0.005 0.369 -0.108 0.188 0.030
Mean_platelet_volume 0.049 0.044 0.012 0.001 -0.044 0.078 0.075 1.000 0.040 0.080 0.002 -0.131 0.056 0.070
Red_blood_Cells -0.014 0.045 -0.047 -0.125 -0.102 0.872 0.841 0.040 1.000 -0.010 0.089 -0.038 -0.363 -0.457
Lymphocytes -0.039 -0.005 -0.075 -0.095 -0.088 0.001 -0.005 0.080 -0.010 1.000 -0.027 -0.321 0.013 0.026
Mean_corpuscular_hemoglobin_concentration MCHC -0.034 0.020 -0.016 -0.009 -0.021 0.128 0.369 0.002 0.089 -0.027 1.000 -0.055 0.464 0.032
Leukocytes -0.031 -0.098 -0.035 0.165 0.252 -0.098 -0.108 -0.131 -0.038 -0.321 -0.055 1.000 -0.144 -0.113
Mean_corpuscular_hemoglobin_MCH 0.050 -0.016 -0.070 -0.075 -0.094 0.081 0.188 0.056 -0.363 0.013 0.464 -0.144 1.000 0.895
Mean_corpuscular_volume_MCV 0.084 -0.024 -0.047 -0.059 -0.075 0.028 0.030 0.070 -0.457 0.026 0.032 -0.113 0.895 1.000
In [51]:
# Code for pairplots
sns.pairplot(df_model)
plt.show()
In [52]:
# Code for heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(corr2, annot=True, cmap="Spectral")
plt.show()

Observation from Correlation table, Pairplots and Heatmap¶

There is a strong positive correlation between the following variables;

  • Hematocrit & Red blood cells (0.87)
  • Hemoglobin & Red blood cells (0.84)
  • Mean corpuscular volume & Mean corpuscular hemoglobin (0.89)
  • Hematocrit & Hemoglobin (0.97)
In [53]:
labeled_barplot(df_model,'SARS_Cov_2_exam_result', perc= True) 
plt.show()
In [54]:
df_model['SARS_Cov_2_exam_result'].value_counts()
Out[54]:
0    5086
1     558
Name: SARS_Cov_2_exam_result, dtype: int64

Comment:¶

  • There are 558 positive Covid-19 cases (9.9%) and 5086 cases that tested negative (90.1%).
  • This was the same distribution before data pre-processing.
  • The integrity of the data has therefore been maintained

Recommendations on Analytical Approach¶

Analytical approaches that may be fit to be applied to this problem are:

  1. Logistic Regression
  2. Decision Trees
  3. Random Forests

Milestone 3¶

The analytic models will be utilized:¶

  1. Logistic regression
  2. Decision Trees
  3. Random forests

Metrics of interest¶

  1. Recall (The ability of the model to pick up positive Covid-19 cases)
  2. Accuracy
  3. Precision

The predictions made by this classification model will translate as follows:¶

  • True positive (TP): Positive SARS_Cov_2_exam_result cases correctly predicted by the model.
  • True negative (TN): Negative SARS_Cov_2_exam_result cases correctly predicted by the model.
  • False Positive (FP): Cases predicted as positive by the model but are in reality, negative.
  • False negative (FN): Cases predicted to be negative by the model but are in reality, positive.

1. Split the data¶

In [55]:
# Separating features and the target column
X = df_model.drop('SARS_Cov_2_exam_result', axis=1)
Y = df_model['SARS_Cov_2_exam_result']
In [56]:
# creating dummy variables

X = pd.get_dummies(
    X,
    columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
    drop_first=True,
)

# to ensure all variables are of float type
X = X.astype(float)

X.head()
Out[56]:
Patient_age_quantile Patient_admitted_to_regular_ward_1=yes__0=no Patient_admitted_to_semi_intensive_unit_1=yes__0=no Patient_admitted_to_intensive_care_unit_1=yes__0=no Hematocrit Hemoglobin Mean_platelet_volume Red_blood_Cells Lymphocytes Mean_corpuscular_hemoglobin_concentration MCHC Leukocytes Mean_corpuscular_hemoglobin_MCH Mean_corpuscular_volume_MCV Respiratory_Syncytial_Virus_detected Respiratory_Syncytial_Virus_not_detected Influenza_A_detected Influenza_A_not_detected Influenza_B_detected Influenza_B_not_detected Parainfluenza_1_detected Parainfluenza_1_not_detected CoronavirusNL63_detected CoronavirusNL63_not_detected Rhinovirus_Enterovirus_detected Rhinovirus_Enterovirus_not_detected Coronavirus_HKU1_detected Coronavirus_HKU1_not_detected Parainfluenza_3_detected Parainfluenza_3_not_detected Chlamydophila_pneumoniae_detected Chlamydophila_pneumoniae_not_detected Adenovirus_detected Adenovirus_not_detected Parainfluenza_4_detected Parainfluenza_4_not_detected Coronavirus229E_detected Coronavirus229E_not_detected CoronavirusOC43_detected CoronavirusOC43_not_detected Inf_A_H1N1_2009_detected Inf_A_H1N1_2009_not_detected Bordetella_pertussis_detected Bordetella_pertussis_not_detected Metapneumovirus_detected Metapneumovirus_not_detected Parainfluenza_2_not_detected Influenza_B__rapid_test_negative Influenza_B__rapid_test_positive Influenza_A__rapid_test_negative Influenza_A__rapid_test_positive
0 13.000 0.000 0.000 0.000 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1 17.000 0.000 0.000 0.000 0.237 -0.022 0.011 0.102 0.318 -0.951 -0.095 -0.292 0.166 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 1.000 0.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 1.000 1.000 0.000 1.000 0.000
2 8.000 0.000 0.000 0.000 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
3 5.000 0.000 0.000 0.000 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
4 15.000 0.000 0.000 0.000 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 1.000 0.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 1.000 0.000 0.000 0.000 0.000
In [57]:
# Splitting the data into train and test sets in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=1, shuffle=True)
In [58]:
# adding constant
X = sm.add_constant(X)

# splitting into a temporary and a test set (70:30)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, Y, test_size=0.3, random_state=1, stratify=Y
)

# split temp set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.35, random_state=1, stratify=y_temp
)
In [59]:
X_train.head()
Out[59]:
const Patient_age_quantile Patient_admitted_to_regular_ward_1=yes__0=no Patient_admitted_to_semi_intensive_unit_1=yes__0=no Patient_admitted_to_intensive_care_unit_1=yes__0=no Hematocrit Hemoglobin Mean_platelet_volume Red_blood_Cells Lymphocytes Mean_corpuscular_hemoglobin_concentration MCHC Leukocytes Mean_corpuscular_hemoglobin_MCH Mean_corpuscular_volume_MCV Respiratory_Syncytial_Virus_detected Respiratory_Syncytial_Virus_not_detected Influenza_A_detected Influenza_A_not_detected Influenza_B_detected Influenza_B_not_detected Parainfluenza_1_detected Parainfluenza_1_not_detected CoronavirusNL63_detected CoronavirusNL63_not_detected Rhinovirus_Enterovirus_detected Rhinovirus_Enterovirus_not_detected Coronavirus_HKU1_detected Coronavirus_HKU1_not_detected Parainfluenza_3_detected Parainfluenza_3_not_detected Chlamydophila_pneumoniae_detected Chlamydophila_pneumoniae_not_detected Adenovirus_detected Adenovirus_not_detected Parainfluenza_4_detected Parainfluenza_4_not_detected Coronavirus229E_detected Coronavirus229E_not_detected CoronavirusOC43_detected CoronavirusOC43_not_detected Inf_A_H1N1_2009_detected Inf_A_H1N1_2009_not_detected Bordetella_pertussis_detected Bordetella_pertussis_not_detected Metapneumovirus_detected Metapneumovirus_not_detected Parainfluenza_2_not_detected Influenza_B__rapid_test_negative Influenza_B__rapid_test_positive Influenza_A__rapid_test_negative Influenza_A__rapid_test_positive
5316 1.000 10.000 0.000 0.000 0.000 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
5006 1.000 11.000 0.000 0.000 0.000 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
2433 1.000 1.000 0.000 0.000 0.000 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 1.000 0.000
5437 1.000 0.000 0.000 0.000 1.000 -2.121 -2.341 -0.775 -0.973 0.387 -1.747 0.821 -2.697 -2.217 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
5032 1.000 9.000 0.000 0.000 0.000 0.053 0.040 -0.102 0.014 -0.014 -0.055 -0.213 0.126 0.066 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

2. Model Building¶

In [60]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1},
        index=[0],
    )

    return df_perf
In [61]:
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

Model Building with original data (Overview)¶

In [62]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Cost:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
    )
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation Cost:

dtree: 0.08258823529411764

Validation Performance:

dtree: 0.06569343065693431
In [63]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(('Logistic Regression', LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")

for name, model in models:
    scoring = "recall"
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
    )
    results.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean() * 100))

print("\n" "Training Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_train, model.predict(X_train)) * 100
    print("{}: {}".format(name, scores))
Cross-Validation Performance:

Logistic Regression: 3.9294117647058826
Bagging: 8.658823529411766
Random forest: 5.52156862745098
GBM: 8.658823529411764
Adaboost: 7.482352941176471
dtree: 8.258823529411764

Training Performance:

Logistic Regression: 5.118110236220472
Bagging: 16.92913385826772
Random forest: 17.716535433070867
GBM: 16.535433070866144
Adaboost: 14.173228346456693
dtree: 16.92913385826772
In [64]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results)
ax.set_xticklabels(names)

plt.show()

Comments:¶

- All models have poor recall performance with the original data.¶

- The models will be evaluated individually and undergo techniques that can optimize metrics and performance¶

Logistic Rgression Model In Detail¶

Important function definitions¶

In [65]:
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
    model, predictors, target, threshold=0.5
):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """

    # checking which probabilities are greater than threshold
    pred_temp = model.predict(predictors) > threshold
    # rounding off the above values to get classes
    pred = np.round(pred_temp)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [66]:
# defining a function to plot the confusion_matrix of a classification model


def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """
    y_pred = model.predict(predictors) > threshold
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
In [67]:
# fitting the model on training set
logit = sm.Logit(y_train, X_train.astype(float))
lg = logit.fit(
    disp=False
)  # setting disp=False will remove the information on number of iterations

print(lg.summary())
                             Logit Regression Results                             
==================================================================================
Dep. Variable:     SARS_Cov_2_exam_result   No. Observations:                 2567
Model:                              Logit   Df Residuals:                     2533
Method:                               MLE   Df Model:                           33
Date:                    Sat, 11 Feb 2023   Pseudo R-squ.:                 0.09229
Time:                            01:00:10   Log-Likelihood:                -752.07
converged:                          False   LL-Null:                       -828.54
Covariance Type:                nonrobust   LLR p-value:                 2.324e-17
=======================================================================================================================
                                                          coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------------------
const                                                  -2.9542      0.340     -8.694      0.000      -3.620      -2.288
Patient_age_quantile                                    0.0368      0.013      2.935      0.003       0.012       0.061
Patient_admitted_to_regular_ward_1=yes__0=no            2.2675      0.484      4.682      0.000       1.318       3.217
Patient_admitted_to_semi_intensive_unit_1=yes__0=no     0.8007      0.936      0.855      0.392      -1.034       2.635
Patient_admitted_to_intensive_care_unit_1=yes__0=no     2.8673      1.053      2.723      0.006       0.804       4.931
Hematocrit                                              4.2839      7.108      0.603      0.547      -9.648      18.216
Hemoglobin                                             -3.5583      7.959     -0.447      0.655     -19.158      12.041
Mean_platelet_volume                                    0.1912      0.209      0.915      0.360      -0.218       0.601
Red_blood_Cells                                        -0.9223      2.348     -0.393      0.695      -5.525       3.680
Lymphocytes                                            -0.6403      0.291     -2.200      0.028      -1.211      -0.070
Mean_corpuscular_hemoglobin_concentration MCHC          0.4418      2.392      0.185      0.853      -4.246       5.130
Leukocytes                                             -1.8201      0.403     -4.511      0.000      -2.611      -1.029
Mean_corpuscular_hemoglobin_MCH                         1.2148      2.968      0.409      0.682      -4.602       7.032
Mean_corpuscular_volume_MCV                            -2.0011      3.148     -0.636      0.525      -8.171       4.169
Respiratory_Syncytial_Virus_detected                  -19.4941   9.37e+06  -2.08e-06      1.000   -1.84e+07    1.84e+07
Respiratory_Syncytial_Virus_not_detected                4.5391   9.36e+06   4.85e-07      1.000   -1.83e+07    1.83e+07
Influenza_A_detected                                  -20.1388   4.61e+06  -4.37e-06      1.000   -9.03e+06    9.03e+06
Influenza_A_not_detected                                5.1838   4.58e+06   1.13e-06      1.000   -8.97e+06    8.97e+06
Influenza_B_detected                                   -8.2054   1.43e+06  -5.75e-06      1.000    -2.8e+06     2.8e+06
Influenza_B_not_detected                               -6.7496   1.49e+06  -4.52e-06      1.000   -2.93e+06    2.93e+06
Parainfluenza_1_detected                              -16.0470        nan        nan        nan         nan         nan
Parainfluenza_1_not_detected                            1.0920        nan        nan        nan         nan         nan
CoronavirusNL63_detected                               -7.7202   1.34e+07  -5.77e-07      1.000   -2.62e+07    2.62e+07
CoronavirusNL63_not_detected                           -7.2348   1.33e+07  -5.42e-07      1.000   -2.62e+07    2.62e+07
Rhinovirus_Enterovirus_detected                        -8.7523        nan        nan        nan         nan         nan
Rhinovirus_Enterovirus_not_detected                    -6.2027        nan        nan        nan         nan         nan
Coronavirus_HKU1_detected                             -19.6165   4.03e+06  -4.87e-06      1.000    -7.9e+06     7.9e+06
Coronavirus_HKU1_not_detected                           4.6615   4.02e+06   1.16e-06      1.000   -7.89e+06    7.89e+06
Parainfluenza_3_detected                              -17.3682    1.5e+07  -1.15e-06      1.000   -2.95e+07    2.95e+07
Parainfluenza_3_not_detected                            2.4132    1.5e+07    1.6e-07      1.000   -2.95e+07    2.95e+07
Chlamydophila_pneumoniae_detected                     -24.9985   5.75e+07  -4.34e-07      1.000   -1.13e+08    1.13e+08
Chlamydophila_pneumoniae_not_detected                  10.0435   1.77e+07   5.68e-07      1.000   -3.46e+07    3.46e+07
Adenovirus_detected                                   -18.1330        nan        nan        nan         nan         nan
Adenovirus_not_detected                                 3.1780        nan        nan        nan         nan         nan
Parainfluenza_4_detected                              -16.2110   4.97e+06  -3.26e-06      1.000   -9.75e+06    9.75e+06
Parainfluenza_4_not_detected                            1.2559   4.97e+06   2.53e-07      1.000   -9.75e+06    9.75e+06
Coronavirus229E_detected                              -14.1399        nan        nan        nan         nan         nan
Coronavirus229E_not_detected                           -0.8151        nan        nan        nan         nan         nan
CoronavirusOC43_detected                              -18.2074   6.48e+06  -2.81e-06      1.000   -1.27e+07    1.27e+07
CoronavirusOC43_not_detected                            3.2524   6.47e+06   5.02e-07      1.000   -1.27e+07    1.27e+07
Inf_A_H1N1_2009_detected                              -17.4714    5.3e+06   -3.3e-06      1.000   -1.04e+07    1.04e+07
Inf_A_H1N1_2009_not_detected                            2.5164    5.3e+06   4.75e-07      1.000   -1.04e+07    1.04e+07
Bordetella_pertussis_detected                         -15.0837        nan        nan        nan         nan         nan
Bordetella_pertussis_not_detected                       0.1287        nan        nan        nan         nan         nan
Metapneumovirus_detected                              -12.6127   1.89e+07  -6.67e-07      1.000   -3.71e+07    3.71e+07
Metapneumovirus_not_detected                           -2.3423   1.89e+07  -1.24e-07      1.000   -3.71e+07    3.71e+07
Parainfluenza_2_not_detected                          -14.9550   2.33e+07  -6.41e-07      1.000   -4.57e+07    4.57e+07
Influenza_B__rapid_test_negative                       -1.3120        nan        nan        nan         nan         nan
Influenza_B__rapid_test_positive                      -21.2481        nan        nan        nan         nan         nan
Influenza_A__rapid_test_negative                        1.1056        nan        nan        nan         nan         nan
Influenza_A__rapid_test_positive                      -23.6657        nan        nan        nan         nan         nan
=======================================================================================================================
/Users/kofori/opt/anaconda3/lib/python3.9/site-packages/statsmodels/base/model.py:604: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "

Model performance evaluation¶

In [68]:
# predicting on training set
# default threshold is 0.5, if predicted probability is greater than 0.5 the observation will be classified as 1

pred_train = lg.predict(X_train) > 0.5
pred_train = np.round(pred_train)
In [69]:
cm = confusion_matrix(y_train, pred_train)
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()

The confusion matrix for training set

  • True Positives (TP): There are 21 true positive Covid-19 cases.
  • True Negatives (TN): There are 2306 true negative Covid-19 cases.
  • False Positives (FP): The model predicted 7 false positive cases.
  • False Negatives (FN): The model predicted 233 false negatives.

Checking for Multicolinearity¶

In [70]:
# let's check the VIF of the predictors
vif_series = pd.Series(
    [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
    index=X_train.columns,
    dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: 

const                                                   16.383
Patient_age_quantile                                     1.106
Patient_admitted_to_regular_ward_1=yes__0=no             1.175
Patient_admitted_to_semi_intensive_unit_1=yes__0=no      1.168
Patient_admitted_to_intensive_care_unit_1=yes__0=no      1.148
Hematocrit                                            1247.367
Hemoglobin                                            1362.359
Mean_platelet_volume                                     1.079
Red_blood_Cells                                         96.796
Lymphocytes                                              1.211
Mean_corpuscular_hemoglobin_concentration MCHC          85.543
Leukocytes                                               1.326
Mean_corpuscular_hemoglobin_MCH                        149.056
Mean_corpuscular_volume_MCV                            157.850
Respiratory_Syncytial_Virus_detected                       inf
Respiratory_Syncytial_Virus_not_detected                   inf
Influenza_A_detected                                       inf
Influenza_A_not_detected                                   inf
Influenza_B_detected                                       inf
Influenza_B_not_detected                                   inf
Parainfluenza_1_detected                                   inf
Parainfluenza_1_not_detected                               inf
CoronavirusNL63_detected                                   inf
CoronavirusNL63_not_detected                               inf
Rhinovirus_Enterovirus_detected                            inf
Rhinovirus_Enterovirus_not_detected                        inf
Coronavirus_HKU1_detected                                  inf
Coronavirus_HKU1_not_detected                              inf
Parainfluenza_3_detected                                   inf
Parainfluenza_3_not_detected                               inf
Chlamydophila_pneumoniae_detected                          inf
Chlamydophila_pneumoniae_not_detected                      inf
Adenovirus_detected                                        inf
Adenovirus_not_detected                                    inf
Parainfluenza_4_detected                                   inf
Parainfluenza_4_not_detected                               inf
Coronavirus229E_detected                                   inf
Coronavirus229E_not_detected                               inf
CoronavirusOC43_detected                                   inf
CoronavirusOC43_not_detected                               inf
Inf_A_H1N1_2009_detected                                   inf
Inf_A_H1N1_2009_not_detected                               inf
Bordetella_pertussis_detected                              inf
Bordetella_pertussis_not_detected                          inf
Metapneumovirus_detected                                   inf
Metapneumovirus_not_detected                               inf
Parainfluenza_2_not_detected                               inf
Influenza_B__rapid_test_negative                           inf
Influenza_B__rapid_test_positive                           inf
Influenza_A__rapid_test_negative                           inf
Influenza_A__rapid_test_positive                           inf
dtype: float64

Observations:¶

Features with VIF greater than 5 are:

  • Hematocrit
  • Hemoglobin
  • Red_blood_Cells
  • Mean_platelet_volume
  • Mean_corpuscular_hemoglobin_MCH
  • Mean_corpuscular_volume_MCV

Next step will be to drop the above features individually and then recheck multicolinearity¶

In [71]:
X_train2 = X_train.drop(["Hematocrit"], axis=1)
logit = sm.Logit(y_train, X_train2.astype(float))
lg = logit.fit(
    disp=False
)  # setting disp=False will remove the information on number of iterations

print(lg.summary())
                             Logit Regression Results                             
==================================================================================
Dep. Variable:     SARS_Cov_2_exam_result   No. Observations:                 2567
Model:                              Logit   Df Residuals:                     2534
Method:                               MLE   Df Model:                           32
Date:                    Sat, 11 Feb 2023   Pseudo R-squ.:                 -0.3301
Time:                            01:00:10   Log-Likelihood:                -1102.1
converged:                          False   LL-Null:                       -828.54
Covariance Type:                nonrobust   LLR p-value:                     1.000
=======================================================================================================================
                                                          coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------------------
const                                                  -2.9533      0.344     -8.583      0.000      -3.628      -2.279
Patient_age_quantile                                    0.0365      0.013      2.915      0.004       0.012       0.061
Patient_admitted_to_regular_ward_1=yes__0=no            2.2434      0.485      4.630      0.000       1.294       3.193
Patient_admitted_to_semi_intensive_unit_1=yes__0=no     0.7503      0.942      0.796      0.426      -1.096       2.597
Patient_admitted_to_intensive_care_unit_1=yes__0=no     2.7938      1.035      2.699      0.007       0.765       4.823
Hemoglobin                                              0.9961      2.366      0.421      0.674      -3.641       5.633
Mean_platelet_volume                                    0.1944      0.209      0.930      0.352      -0.215       0.604
Red_blood_Cells                                        -0.9173      2.470     -0.371      0.710      -5.759       3.924
Lymphocytes                                            -0.6542      0.291     -2.250      0.024      -1.224      -0.084
Mean_corpuscular_hemoglobin_concentration MCHC         -0.6593      1.602     -0.412      0.681      -3.799       2.481
Leukocytes                                             -1.7881      0.400     -4.469      0.000      -2.572      -1.004
Mean_corpuscular_hemoglobin_MCH                         1.0910      2.990      0.365      0.715      -4.769       6.951
Mean_corpuscular_volume_MCV                            -1.8656      3.237     -0.576      0.564      -8.209       4.478
Respiratory_Syncytial_Virus_detected                  -18.1605   4.88e+06  -3.72e-06      1.000   -9.57e+06    9.57e+06
Respiratory_Syncytial_Virus_not_detected               -3.9729   4.88e+06  -8.15e-07      1.000   -9.56e+06    9.56e+06
Influenza_A_detected                                  -21.8960        nan        nan        nan         nan         nan
Influenza_A_not_detected                               -0.2374        nan        nan        nan         nan         nan
Influenza_B_detected                                  -11.7725        nan        nan        nan         nan         nan
Influenza_B_not_detected                              -10.3609        nan        nan        nan         nan         nan
Parainfluenza_1_detected                             -140.6339   5.03e+56  -2.79e-55      1.000   -9.86e+56    9.86e+56
Parainfluenza_1_not_detected                          118.5004   5.05e+06   2.35e-05      1.000   -9.89e+06    9.89e+06
CoronavirusNL63_detected                              -11.2947        nan        nan        nan         nan         nan
CoronavirusNL63_not_detected                          -10.8387        nan        nan        nan         nan         nan
Rhinovirus_Enterovirus_detected                       -12.3511   2.28e+06  -5.42e-06      1.000   -4.46e+06    4.46e+06
Rhinovirus_Enterovirus_not_detected                    -9.7823   2.28e+06  -4.29e-06      1.000   -4.46e+06    4.46e+06
Coronavirus_HKU1_detected                             -14.1479   4.51e+06  -3.14e-06      1.000   -8.83e+06    8.83e+06
Coronavirus_HKU1_not_detected                          -7.9855   4.51e+06  -1.77e-06      1.000   -8.83e+06    8.83e+06
Parainfluenza_3_detected                              -21.6833   1.03e+06   -2.1e-05      1.000   -2.02e+06    2.02e+06
Parainfluenza_3_not_detected                           -0.4501   1.03e+06  -4.37e-07      1.000   -2.02e+06    2.02e+06
Chlamydophila_pneumoniae_detected                     -14.6920        nan        nan        nan         nan         nan
Chlamydophila_pneumoniae_not_detected                  -7.4414        nan        nan        nan         nan         nan
Adenovirus_detected                                   -17.7638        nan        nan        nan         nan         nan
Adenovirus_not_detected                                -4.3696        nan        nan        nan         nan         nan
Parainfluenza_4_detected                                9.9741        nan        nan        nan         nan         nan
Parainfluenza_4_not_detected                          -32.1075        nan        nan        nan         nan         nan
Coronavirus229E_detected                              -15.6418   4.25e+06  -3.68e-06      1.000   -8.33e+06    8.33e+06
Coronavirus229E_not_detected                           -6.4916   4.25e+06  -1.53e-06      1.000   -8.33e+06    8.33e+06
CoronavirusOC43_detected                              -22.0646   7.03e+06  -3.14e-06      1.000   -1.38e+07    1.38e+07
CoronavirusOC43_not_detected                           -0.0688   7.03e+06  -9.79e-09      1.000   -1.38e+07    1.38e+07
Inf_A_H1N1_2009_detected                              -29.9233   6.34e+07  -4.72e-07      1.000   -1.24e+08    1.24e+08
Inf_A_H1N1_2009_not_detected                            7.7899   2.25e+07   3.47e-07      1.000    -4.4e+07     4.4e+07
Bordetella_pertussis_detected                         -15.0491   3.79e+06  -3.97e-06      1.000   -7.43e+06    7.43e+06
Bordetella_pertussis_not_detected                      -7.0843   3.79e+06  -1.87e-06      1.000   -7.43e+06    7.43e+06
Metapneumovirus_detected                              -19.1391   3.83e+06     -5e-06      1.000    -7.5e+06     7.5e+06
Metapneumovirus_not_detected                           -2.9943   3.83e+06  -7.83e-07      1.000    -7.5e+06     7.5e+06
Parainfluenza_2_not_detected                          -22.1334   1.72e+07  -1.28e-06      1.000   -3.38e+07    3.38e+07
Influenza_B__rapid_test_negative                       11.4975   8.48e+06   1.36e-06      1.000   -1.66e+07    1.66e+07
Influenza_B__rapid_test_positive                      -44.6537   1.26e+12  -3.56e-11      1.000   -2.46e+12    2.46e+12
Influenza_A__rapid_test_negative                      -11.6972   8.48e+06  -1.38e-06      1.000   -1.66e+07    1.66e+07
Influenza_A__rapid_test_positive                      -21.4589   8.48e+06  -2.53e-06      1.000   -1.66e+07    1.66e+07
=======================================================================================================================
/Users/kofori/opt/anaconda3/lib/python3.9/site-packages/statsmodels/base/model.py:604: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
In [72]:
# let's check the VIF of the predictors
vif_series = pd.Series(
    [variance_inflation_factor(X_train2.values, i) for i in range(X_train2.shape[1])],
    index=X_train2.columns,
    dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: 

const                                                  16.331
Patient_age_quantile                                    1.104
Patient_admitted_to_regular_ward_1=yes__0=no            1.168
Patient_admitted_to_semi_intensive_unit_1=yes__0=no     1.164
Patient_admitted_to_intensive_care_unit_1=yes__0=no     1.146
Hemoglobin                                             79.025
Mean_platelet_volume                                    1.071
Red_blood_Cells                                        90.763
Lymphocytes                                             1.196
Mean_corpuscular_hemoglobin_concentration MCHC         30.896
Leukocytes                                              1.324
Mean_corpuscular_hemoglobin_MCH                       143.537
Mean_corpuscular_volume_MCV                           145.782
Respiratory_Syncytial_Virus_detected                      inf
Respiratory_Syncytial_Virus_not_detected                  inf
Influenza_A_detected                                      inf
Influenza_A_not_detected                                  inf
Influenza_B_detected                                      inf
Influenza_B_not_detected                                  inf
Parainfluenza_1_detected                                  inf
Parainfluenza_1_not_detected                              inf
CoronavirusNL63_detected                                  inf
CoronavirusNL63_not_detected                              inf
Rhinovirus_Enterovirus_detected                           inf
Rhinovirus_Enterovirus_not_detected                       inf
Coronavirus_HKU1_detected                                 inf
Coronavirus_HKU1_not_detected                             inf
Parainfluenza_3_detected                                  inf
Parainfluenza_3_not_detected                              inf
Chlamydophila_pneumoniae_detected                         inf
Chlamydophila_pneumoniae_not_detected                     inf
Adenovirus_detected                                       inf
Adenovirus_not_detected                                   inf
Parainfluenza_4_detected                                  inf
Parainfluenza_4_not_detected                              inf
Coronavirus229E_detected                                  inf
Coronavirus229E_not_detected                              inf
CoronavirusOC43_detected                                  inf
CoronavirusOC43_not_detected                              inf
Inf_A_H1N1_2009_detected                                  inf
Inf_A_H1N1_2009_not_detected                              inf
Bordetella_pertussis_detected                             inf
Bordetella_pertussis_not_detected                         inf
Metapneumovirus_detected                                  inf
Metapneumovirus_not_detected                              inf
Parainfluenza_2_not_detected                              inf
Influenza_B__rapid_test_negative                          inf
Influenza_B__rapid_test_positive                          inf
Influenza_A__rapid_test_negative                          inf
Influenza_A__rapid_test_positive                          inf
dtype: float64

Comment¶

  • Not much changed in the VIF values after droppinf 'Hematocrit'
  • Variables with VIF >5.will therefore be dropped until the issue of multicolinearity is resolved.
In [73]:
### Dropping all variables with VIF >5
X_train3 = X_train2.drop(["Red_blood_Cells", "Hemoglobin","Mean_corpuscular_hemoglobin_MCH"], axis=1)
logit = sm.Logit(y_train, X_train3.astype(float))
lg = logit.fit(
    disp=False
)  # setting disp=False will remove the information on number of iterations

print(lg.summary())
                             Logit Regression Results                             
==================================================================================
Dep. Variable:     SARS_Cov_2_exam_result   No. Observations:                 2567
Model:                              Logit   Df Residuals:                     2537
Method:                               MLE   Df Model:                           29
Date:                    Sat, 11 Feb 2023   Pseudo R-squ.:                 0.09159
Time:                            01:00:11   Log-Likelihood:                -752.65
converged:                          False   LL-Null:                       -828.54
Covariance Type:                nonrobust   LLR p-value:                 1.397e-18
=======================================================================================================================
                                                          coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------------------
const                                                  -2.8293      0.173    -16.309      0.000      -3.169      -2.489
Patient_age_quantile                                    0.0363      0.012      2.920      0.004       0.012       0.061
Patient_admitted_to_regular_ward_1=yes__0=no            2.1344      0.444      4.811      0.000       1.265       3.004
Patient_admitted_to_semi_intensive_unit_1=yes__0=no     0.7016      0.904      0.776      0.438      -1.070       2.473
Patient_admitted_to_intensive_care_unit_1=yes__0=no     2.6826      1.008      2.660      0.008       0.706       4.659
Mean_platelet_volume                                    0.1808      0.199      0.909      0.363      -0.209       0.571
Lymphocytes                                            -0.6457      0.280     -2.308      0.021      -1.194      -0.097
Mean_corpuscular_hemoglobin_concentration MCHC          0.1107      0.233      0.475      0.635      -0.346       0.568
Leukocytes                                             -1.7292      0.375     -4.615      0.000      -2.463      -0.995
Mean_corpuscular_volume_MCV                            -0.4449      0.243     -1.833      0.067      -0.921       0.031
Respiratory_Syncytial_Virus_detected                  -17.0293        nan        nan        nan         nan         nan
Respiratory_Syncytial_Virus_not_detected                4.4555        nan        nan        nan         nan         nan
Influenza_A_detected                                  -14.4232   1.15e+07  -1.25e-06      1.000   -2.25e+07    2.25e+07
Influenza_A_not_detected                                1.8493   1.15e+07   1.61e-07      1.000   -2.25e+07    2.25e+07
Influenza_B_detected                                   -7.0000        nan        nan        nan         nan         nan
Influenza_B_not_detected                               -5.5738        nan        nan        nan         nan         nan
Parainfluenza_1_detected                              -13.8470   1.06e+07   -1.3e-06      1.000   -2.08e+07    2.08e+07
Parainfluenza_1_not_detected                            1.2732   1.07e+07   1.19e-07      1.000   -2.09e+07    2.09e+07
CoronavirusNL63_detected                               -6.5220        nan        nan        nan         nan         nan
CoronavirusNL63_not_detected                           -6.0518        nan        nan        nan         nan         nan
Rhinovirus_Enterovirus_detected                        -7.5791        nan        nan        nan         nan         nan
Rhinovirus_Enterovirus_not_detected                    -4.9947        nan        nan        nan         nan         nan
Coronavirus_HKU1_detected                             -12.6400        nan        nan        nan         nan         nan
Coronavirus_HKU1_not_detected                           0.0662        nan        nan        nan         nan         nan
Parainfluenza_3_detected                              -13.7939   1.96e+07  -7.04e-07      1.000   -3.84e+07    3.84e+07
Parainfluenza_3_not_detected                            1.2200   1.96e+07   6.23e-08      1.000   -3.84e+07    3.84e+07
Chlamydophila_pneumoniae_detected                     -15.9385   2.34e+07  -6.81e-07      1.000   -4.59e+07    4.59e+07
Chlamydophila_pneumoniae_not_detected                   3.3647   2.34e+07   1.44e-07      1.000   -4.59e+07    4.59e+07
Adenovirus_detected                                   -12.1188        nan        nan        nan         nan         nan
Adenovirus_not_detected                                -0.4550        nan        nan        nan         nan         nan
Parainfluenza_4_detected                              -12.9639   1.53e+07  -8.47e-07      1.000      -3e+07       3e+07
Parainfluenza_4_not_detected                            0.3901   1.53e+07   2.55e-08      1.000      -3e+07       3e+07
Coronavirus229E_detected                              -16.1049   7.92e+06  -2.03e-06      1.000   -1.55e+07    1.55e+07
Coronavirus229E_not_detected                            3.5311   7.92e+06   4.46e-07      1.000   -1.55e+07    1.55e+07
CoronavirusOC43_detected                              -15.4096        nan        nan        nan         nan         nan
CoronavirusOC43_not_detected                            2.8358        nan        nan        nan         nan         nan
Inf_A_H1N1_2009_detected                              -17.9507   6.19e+06   -2.9e-06      1.000   -1.21e+07    1.21e+07
Inf_A_H1N1_2009_not_detected                            5.3769   6.19e+06   8.68e-07      1.000   -1.21e+07    1.21e+07
Bordetella_pertussis_detected                         -14.4267        nan        nan        nan         nan         nan
Bordetella_pertussis_not_detected                       1.8529        nan        nan        nan         nan         nan
Metapneumovirus_detected                              -15.9689        nan        nan        nan         nan         nan
Metapneumovirus_not_detected                            3.3951        nan        nan        nan         nan         nan
Parainfluenza_2_not_detected                          -12.5738        nan        nan        nan         nan         nan
Influenza_B__rapid_test_negative                       -5.5340   6.72e+07  -8.23e-08      1.000   -1.32e+08    1.32e+08
Influenza_B__rapid_test_positive                       -7.6429   6.72e+07  -1.14e-07      1.000   -1.32e+08    1.32e+08
Influenza_A__rapid_test_negative                        5.3317   6.72e+07   7.93e-08      1.000   -1.32e+08    1.32e+08
Influenza_A__rapid_test_positive                      -18.5085   6.72e+07  -2.75e-07      1.000   -1.32e+08    1.32e+08
=======================================================================================================================
/Users/kofori/opt/anaconda3/lib/python3.9/site-packages/statsmodels/base/model.py:604: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
In [74]:
# let's check the VIF of the predictors
vif_series = pd.Series(
    [variance_inflation_factor(X_train3.values, i) for i in range(X_train3.shape[1])],
    index=X_train3.columns,
    dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series))
VIF values: 

const                                                 5.222
Patient_age_quantile                                  1.087
Patient_admitted_to_regular_ward_1=yes__0=no          1.059
Patient_admitted_to_semi_intensive_unit_1=yes__0=no   1.116
Patient_admitted_to_intensive_care_unit_1=yes__0=no   1.100
Mean_platelet_volume                                  1.057
Lymphocytes                                           1.165
Mean_corpuscular_hemoglobin_concentration MCHC        1.026
Leukocytes                                            1.272
Mean_corpuscular_volume_MCV                           1.029
Respiratory_Syncytial_Virus_detected                    inf
Respiratory_Syncytial_Virus_not_detected                inf
Influenza_A_detected                                    inf
Influenza_A_not_detected                                inf
Influenza_B_detected                                    inf
Influenza_B_not_detected                                inf
Parainfluenza_1_detected                                inf
Parainfluenza_1_not_detected                            inf
CoronavirusNL63_detected                                inf
CoronavirusNL63_not_detected                            inf
Rhinovirus_Enterovirus_detected                         inf
Rhinovirus_Enterovirus_not_detected                     inf
Coronavirus_HKU1_detected                               inf
Coronavirus_HKU1_not_detected                           inf
Parainfluenza_3_detected                                inf
Parainfluenza_3_not_detected                            inf
Chlamydophila_pneumoniae_detected                       inf
Chlamydophila_pneumoniae_not_detected                   inf
Adenovirus_detected                                     inf
Adenovirus_not_detected                                 inf
Parainfluenza_4_detected                                inf
Parainfluenza_4_not_detected                            inf
Coronavirus229E_detected                                inf
Coronavirus229E_not_detected                            inf
CoronavirusOC43_detected                                inf
CoronavirusOC43_not_detected                            inf
Inf_A_H1N1_2009_detected                                inf
Inf_A_H1N1_2009_not_detected                            inf
Bordetella_pertussis_detected                           inf
Bordetella_pertussis_not_detected                       inf
Metapneumovirus_detected                                inf
Metapneumovirus_not_detected                            inf
Parainfluenza_2_not_detected                            inf
Influenza_B__rapid_test_negative                        inf
Influenza_B__rapid_test_positive                        inf
Influenza_A__rapid_test_negative                        inf
Influenza_A__rapid_test_positive                        inf
dtype: float64

Observation:¶

  • None of the variables exhibit high multicollinearity (VIF >5).
  • The values in this summary are therefore reliable

Using coefficients of the model to find odds¶

In [75]:
# converting coefficients to odds
odds = np.exp(lg.params)

# finding the percentage change
perc_change_odds = (np.exp(lg.params) - 1) * 100

# removing limit from number of columns to display
pd.set_option("display.max_columns", None)

# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train3.columns).T
Out[75]:
const Patient_age_quantile Patient_admitted_to_regular_ward_1=yes__0=no Patient_admitted_to_semi_intensive_unit_1=yes__0=no Patient_admitted_to_intensive_care_unit_1=yes__0=no Mean_platelet_volume Lymphocytes Mean_corpuscular_hemoglobin_concentration MCHC Leukocytes Mean_corpuscular_volume_MCV Respiratory_Syncytial_Virus_detected Respiratory_Syncytial_Virus_not_detected Influenza_A_detected Influenza_A_not_detected Influenza_B_detected Influenza_B_not_detected Parainfluenza_1_detected Parainfluenza_1_not_detected CoronavirusNL63_detected CoronavirusNL63_not_detected Rhinovirus_Enterovirus_detected Rhinovirus_Enterovirus_not_detected Coronavirus_HKU1_detected Coronavirus_HKU1_not_detected Parainfluenza_3_detected Parainfluenza_3_not_detected Chlamydophila_pneumoniae_detected Chlamydophila_pneumoniae_not_detected Adenovirus_detected Adenovirus_not_detected Parainfluenza_4_detected Parainfluenza_4_not_detected Coronavirus229E_detected Coronavirus229E_not_detected CoronavirusOC43_detected CoronavirusOC43_not_detected Inf_A_H1N1_2009_detected Inf_A_H1N1_2009_not_detected Bordetella_pertussis_detected Bordetella_pertussis_not_detected Metapneumovirus_detected Metapneumovirus_not_detected Parainfluenza_2_not_detected Influenza_B__rapid_test_negative Influenza_B__rapid_test_positive Influenza_A__rapid_test_negative Influenza_A__rapid_test_positive
Odds 0.059 1.037 8.452 2.017 14.623 1.198 0.524 1.117 0.177 0.641 0.000 86.097 0.000 6.356 0.001 0.004 0.000 3.572 0.001 0.002 0.001 0.007 0.000 1.068 0.000 3.387 0.000 28.924 0.000 0.634 0.000 1.477 0.000 34.163 0.000 17.045 0.000 216.355 0.000 6.378 0.000 29.817 0.000 0.004 0.000 206.793 0.000
Change_odd% -94.094 3.702 745.212 101.693 1362.333 19.820 -47.568 11.706 -82.257 -35.911 -100.000 8509.704 -100.000 535.565 -99.909 -99.620 -100.000 257.242 -99.853 -99.765 -99.949 -99.323 -100.000 6.839 -100.000 238.734 -100.000 2792.426 -99.999 -36.554 -100.000 47.710 -100.000 3316.288 -100.000 1604.452 -100.000 21535.479 -100.000 537.845 -100.000 2881.663 -100.000 -99.605 -99.952 20579.273 -100.000

Key Coefficient interpretations¶

  • A patient being admitted to the ward with flu-like symptoms increases likelihood of a possible covid-19 infection by 8.45
  • Being in the intensive unit with flu-like symptoms makes it 14.62 times more likely that a patient is Covid-19 positive.
  • Low Lymphocyte levels reduce likelihood of Covid-19 by 45.57%

Logistic regression model performance on the training set¶

In [76]:
# creating confusion matrix
confusion_matrix_statsmodels(lg, X_train3, y_train)

Observations¶

  • True Positives (TP): There are 17 true positive Covid-19 cases.
  • True Negatives (TN): There are 2306 true negative Covid-19 cases.
  • False Positives (FP): The model predicted 6 false positive cases.
  • False Negatives (FN): The model predicted 237 false negatives.
In [77]:
log_reg_model_train_perf = model_performance_classification_statsmodels(
    lg, X_train3, y_train
)

print("Training performance:")
log_reg_model_train_perf
Training performance:
Out[77]:
Accuracy Recall Precision F1
0 0.905 0.067 0.739 0.123

Comment¶

Although the accuracy and pecision of the model are fairly good, recall is poor (6.7%)

ROC-AUC¶

In [78]:
logit_roc_auc_train = roc_auc_score(y_train, lg.predict(X_train3))
fpr, tpr, thresholds = roc_curve(y_train, lg.predict(X_train3))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()

Observation:¶

  • Logistic Regression model has a poor recall and but a fair ROC-AUC score
  • This means that in terms of accuracy, this model performs relatively well

Using Precision-Recall curve to find a threshold¶

In [79]:
y_scores = lg.predict(X_train3)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)


def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="precision")
    plt.plot(thresholds, recalls[:-1], "g--", label="recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])


plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()

Observation:¶

At threshold of about 0.16, precision=recall

In [80]:
# setting the threshold
optimal_threshold_curve = 0.16
In [81]:
# creating confusion matrix
confusion_matrix_statsmodels(lg, X_train3, y_train, threshold=optimal_threshold_curve)
In [82]:
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
    lg, X_train3, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
Out[82]:
Accuracy Recall Precision F1
0 0.898 0.142 0.444 0.215

Observation¶

By using the optimal threshold;

  • The recall improved to 14% albeit still low.
  • The number of true positives increased to 36
  • The number of false negatives decreased (218).

Performance on the validation set¶

In [83]:
X_val_3 = X_val[X_train3.columns].astype(float)
# creating confusion matrix
confusion_matrix_statsmodels(lg, X_val_3, y_val, threshold=optimal_threshold_curve)
In [84]:
log_reg_model_test_perf = model_performance_classification_statsmodels(
    lg, X_val_3, y_val, threshold=optimal_threshold_curve
)

print("Validation set performance:")
log_reg_model_test_perf
Validation set performance:
Out[84]:
Accuracy Recall Precision F1
0 0.892 0.131 0.375 0.195

Observation:¶

  • Performance on the validation set is about the same as that of the training set.
  • The model is generalizing well.
  • Although accuracy is good (89%), recall is still low (13%)
  • This implies that the model is unable to effectively pick up positive cases

Logistic regession performance on the test set¶

In [85]:
X_test3 = X_test[X_train3.columns].astype(float)
# creating confusion matrix
confusion_matrix_statsmodels(lg, X_test3, y_test)

log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
    lg, X_test3, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
Out[85]:
Accuracy Recall Precision F1
0 0.901 0.162 0.500 0.244

Final Model Summary for Logistic Regression¶

In [86]:
print(lg.summary())
                             Logit Regression Results                             
==================================================================================
Dep. Variable:     SARS_Cov_2_exam_result   No. Observations:                 2567
Model:                              Logit   Df Residuals:                     2537
Method:                               MLE   Df Model:                           29
Date:                    Sat, 11 Feb 2023   Pseudo R-squ.:                 0.09159
Time:                            01:00:13   Log-Likelihood:                -752.65
converged:                          False   LL-Null:                       -828.54
Covariance Type:                nonrobust   LLR p-value:                 1.397e-18
=======================================================================================================================
                                                          coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------------------
const                                                  -2.8293      0.173    -16.309      0.000      -3.169      -2.489
Patient_age_quantile                                    0.0363      0.012      2.920      0.004       0.012       0.061
Patient_admitted_to_regular_ward_1=yes__0=no            2.1344      0.444      4.811      0.000       1.265       3.004
Patient_admitted_to_semi_intensive_unit_1=yes__0=no     0.7016      0.904      0.776      0.438      -1.070       2.473
Patient_admitted_to_intensive_care_unit_1=yes__0=no     2.6826      1.008      2.660      0.008       0.706       4.659
Mean_platelet_volume                                    0.1808      0.199      0.909      0.363      -0.209       0.571
Lymphocytes                                            -0.6457      0.280     -2.308      0.021      -1.194      -0.097
Mean_corpuscular_hemoglobin_concentration MCHC          0.1107      0.233      0.475      0.635      -0.346       0.568
Leukocytes                                             -1.7292      0.375     -4.615      0.000      -2.463      -0.995
Mean_corpuscular_volume_MCV                            -0.4449      0.243     -1.833      0.067      -0.921       0.031
Respiratory_Syncytial_Virus_detected                  -17.0293        nan        nan        nan         nan         nan
Respiratory_Syncytial_Virus_not_detected                4.4555        nan        nan        nan         nan         nan
Influenza_A_detected                                  -14.4232   1.15e+07  -1.25e-06      1.000   -2.25e+07    2.25e+07
Influenza_A_not_detected                                1.8493   1.15e+07   1.61e-07      1.000   -2.25e+07    2.25e+07
Influenza_B_detected                                   -7.0000        nan        nan        nan         nan         nan
Influenza_B_not_detected                               -5.5738        nan        nan        nan         nan         nan
Parainfluenza_1_detected                              -13.8470   1.06e+07   -1.3e-06      1.000   -2.08e+07    2.08e+07
Parainfluenza_1_not_detected                            1.2732   1.07e+07   1.19e-07      1.000   -2.09e+07    2.09e+07
CoronavirusNL63_detected                               -6.5220        nan        nan        nan         nan         nan
CoronavirusNL63_not_detected                           -6.0518        nan        nan        nan         nan         nan
Rhinovirus_Enterovirus_detected                        -7.5791        nan        nan        nan         nan         nan
Rhinovirus_Enterovirus_not_detected                    -4.9947        nan        nan        nan         nan         nan
Coronavirus_HKU1_detected                             -12.6400        nan        nan        nan         nan         nan
Coronavirus_HKU1_not_detected                           0.0662        nan        nan        nan         nan         nan
Parainfluenza_3_detected                              -13.7939   1.96e+07  -7.04e-07      1.000   -3.84e+07    3.84e+07
Parainfluenza_3_not_detected                            1.2200   1.96e+07   6.23e-08      1.000   -3.84e+07    3.84e+07
Chlamydophila_pneumoniae_detected                     -15.9385   2.34e+07  -6.81e-07      1.000   -4.59e+07    4.59e+07
Chlamydophila_pneumoniae_not_detected                   3.3647   2.34e+07   1.44e-07      1.000   -4.59e+07    4.59e+07
Adenovirus_detected                                   -12.1188        nan        nan        nan         nan         nan
Adenovirus_not_detected                                -0.4550        nan        nan        nan         nan         nan
Parainfluenza_4_detected                              -12.9639   1.53e+07  -8.47e-07      1.000      -3e+07       3e+07
Parainfluenza_4_not_detected                            0.3901   1.53e+07   2.55e-08      1.000      -3e+07       3e+07
Coronavirus229E_detected                              -16.1049   7.92e+06  -2.03e-06      1.000   -1.55e+07    1.55e+07
Coronavirus229E_not_detected                            3.5311   7.92e+06   4.46e-07      1.000   -1.55e+07    1.55e+07
CoronavirusOC43_detected                              -15.4096        nan        nan        nan         nan         nan
CoronavirusOC43_not_detected                            2.8358        nan        nan        nan         nan         nan
Inf_A_H1N1_2009_detected                              -17.9507   6.19e+06   -2.9e-06      1.000   -1.21e+07    1.21e+07
Inf_A_H1N1_2009_not_detected                            5.3769   6.19e+06   8.68e-07      1.000   -1.21e+07    1.21e+07
Bordetella_pertussis_detected                         -14.4267        nan        nan        nan         nan         nan
Bordetella_pertussis_not_detected                       1.8529        nan        nan        nan         nan         nan
Metapneumovirus_detected                              -15.9689        nan        nan        nan         nan         nan
Metapneumovirus_not_detected                            3.3951        nan        nan        nan         nan         nan
Parainfluenza_2_not_detected                          -12.5738        nan        nan        nan         nan         nan
Influenza_B__rapid_test_negative                       -5.5340   6.72e+07  -8.23e-08      1.000   -1.32e+08    1.32e+08
Influenza_B__rapid_test_positive                       -7.6429   6.72e+07  -1.14e-07      1.000   -1.32e+08    1.32e+08
Influenza_A__rapid_test_negative                        5.3317   6.72e+07   7.93e-08      1.000   -1.32e+08    1.32e+08
Influenza_A__rapid_test_positive                      -18.5085   6.72e+07  -2.75e-07      1.000   -1.32e+08    1.32e+08
=======================================================================================================================

Key Observations¶

  • The model performs well on both training and test data sets
  • The recall on the test setis 16.2%.
  • Using the model with default threshold the model will give a low recall but good precision scores.
  • Using the model with 0.16 threshold the model will give a balance recall and precision score - Some features of value (p value <0.05) are:
  1. Patient_age_quantile
  2. Patient_admitted_to_regular_ward_1=yes__0=no
  3. Patient_admitted_to_intensive_care_unit_1=yes__0=no
  4. Lymphocytes
  5. Leukocytes

Decision Trees in detail¶

In [87]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [88]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
In [89]:
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
    '''
    model : classifier to predict values of X
    y_actual : ground truth  
    
    '''
    y_predict = model.predict(X_test)
    cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
    df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
                  columns = [i for i in ['Predicted - No','Predicted - Yes']])
    group_counts = ["{0:0.0f}".format(value) for value in
                cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in
                         cm.flatten()/np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in
              zip(group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    plt.figure(figsize = (10,7))
    sns.heatmap(df_cm, annot=labels,fmt='')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
In [90]:
##  Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model,flag=True):
    '''
    model : classifier to predict values of X

    '''
    # defining an empty list to store train and test results
    score_list=[] 
    
    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)
    
    train_acc = model.score(X_train,y_train)
    test_acc = model.score(X_test,y_test)
    
    train_recall = metrics.recall_score(y_train,pred_train)
    test_recall = metrics.recall_score(y_test,pred_test)
    
    train_precision = metrics.precision_score(y_train,pred_train)
    test_precision = metrics.precision_score(y_test,pred_test)
    
    score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision))
        
    # If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
    if flag == True: 
        print("Accuracy on training set : ",model.score(X_train,y_train))
        print("Accuracy on test set : ",model.score(X_test,y_test))
        print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
        print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
        print("Precision on training set : ",metrics.precision_score(y_train,pred_train))
        print("Precision on test set : ",metrics.precision_score(y_test,pred_test))
    
    return score_list # returning the list with train and test scores
In [91]:
model = DecisionTreeClassifier(random_state=1)
model.fit(X_train, y_train)
Out[91]:
DecisionTreeClassifier(random_state=1)
In [92]:
confusion_matrix_sklearn(model, X_train, y_train)
In [93]:
decision_tree_perf_train = model_performance_classification_sklearn(
    model, X_train, y_train
)
decision_tree_perf_train
Out[93]:
Accuracy Recall Precision F1
0 0.918 0.169 1.000 0.290

Observation:¶

  1. The model was able to predict 65 true positive cases and 3559 true negative cases in the training set.
  2. It did not predict any false positive cases.
  3. It has an accuracy of 91.75%, recall of 16.62% and a precision of 100%
In [94]:
confusion_matrix_sklearn(model, X_test, y_test)
In [95]:
decision_tree_perf_test = model_performance_classification_sklearn(
    model, X_test, y_test
)
decision_tree_perf_test
Out[95]:
Accuracy Recall Precision F1
0 0.897 0.084 0.400 0.139

Observation:¶

In the test set;

  1. The model predicted 1502 true negative cases and 21 true positive cases.
  2. The model has 90% accuracy, 12.57% recall and 45.65% precision
In [96]:
feature_names = list(X_train.columns)
importances = model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Observation:¶

The top 5 features of importance are:

  • Leukocytes
  • Patient_age_quantile
  • Patient_admitted_to_regular_ward_1=yes_0=no
  • Lymphocytes
  • MCV

Building bagging and boosting models¶

Bagging Model¶

In [97]:
#base_estimator for bagging classifier is a decision tree by default
bagging_estimator=BaggingClassifier(random_state=1)
bagging_estimator.fit(X_train,y_train)
Out[97]:
BaggingClassifier(random_state=1)
In [98]:
make_confusion_matrix(bagging_estimator,y_test)
In [99]:
#Using above defined function to get accuracy, recall and precision on train and test set
bagging_estimator_score=get_metrics_score(bagging_estimator)
Accuracy on training set :  0.9170237631476431
Accuracy on test set :  0.9020070838252656
Recall on training set :  0.16929133858267717
Recall on test set :  0.07784431137724551
Precision on training set :  0.9555555555555556
Precision on test set :  0.52

Observation:¶

  • The model classifier's accuracy is similar in both the train set and test set.

Random Forest Classifier in detail¶

In [100]:
#Train the random forest classifier
rf_estimator=RandomForestClassifier(random_state=1)
rf_estimator.fit(X_train,y_train)
Out[100]:
RandomForestClassifier(random_state=1)
In [101]:
make_confusion_matrix(rf_estimator,y_test)
In [102]:
#Using above defined function to get accuracy, recall and precision on train and test set
rf_estimator_score=get_metrics_score(rf_estimator)
Accuracy on training set :  0.917802882742501
Accuracy on test set :  0.9061393152302243
Recall on training set :  0.17716535433070865
Recall on test set :  0.059880239520958084
Precision on training set :  0.9574468085106383
Precision on test set :  0.8333333333333334

Observation:¶

  • The model classifier's accuracy is similar in both the train set and test set.
  • Both the bagging classifier and random forest classifier are performing well in the train and test sets using default parameters

Hyperparameter Tuning¶

Bagging Classifier¶

In [103]:
# Choose the type of classifier. 
bagging_estimator_tuned = BaggingClassifier(random_state=1)

# Grid of parameters to choose from
## add from article
parameters = {'max_samples': [0.7,0.8,0.9,1], 
              'max_features': [0.7,0.8,0.9,1],
              'n_estimators' : [10,20,30,40,50],
             }

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(bagging_estimator_tuned, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
bagging_estimator_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
bagging_estimator_tuned.fit(X_train, y_train)
Out[103]:
BaggingClassifier(max_features=0.9, max_samples=0.9, n_estimators=40,
                  random_state=1)
In [104]:
make_confusion_matrix(bagging_estimator_tuned, y_test)
In [105]:
#Using above defined function to get accuracy, recall and precision on train and test set
bagging_estimator_tuned_score=get_metrics_score(bagging_estimator_tuned)
Accuracy on training set :  0.917802882742501
Accuracy on test set :  0.9020070838252656
Recall on training set :  0.16929133858267717
Recall on test set :  0.0658682634730539
Precision on training set :  1.0
Precision on test set :  0.5238095238095238

Observation:¶

The bagging classifier is about the same after hypertuning

Using logistic regression as the base estimator for bagging classifier¶

In [106]:
bagging_lr=BaggingClassifier(base_estimator=LogisticRegression(solver='liblinear',random_state=1,max_iter=1000),random_state=1)
bagging_lr.fit(X_train,y_train)
Out[106]:
BaggingClassifier(base_estimator=LogisticRegression(max_iter=1000,
                                                    random_state=1,
                                                    solver='liblinear'),
                  random_state=1)
In [107]:
make_confusion_matrix(bagging_lr,y_test)
In [108]:
#Using above defined function to get accuracy, recall and precision on train and test set
bagging_lr_score=get_metrics_score(bagging_lr)
Accuracy on training set :  0.9045578496299181
Accuracy on test set :  0.9037780401416765
Recall on training set :  0.051181102362204724
Recall on test set :  0.059880239520958084
Precision on training set :  0.7647058823529411
Precision on test set :  0.625

Observation¶

  • The model performs well on both training and test sets. It is not overfitting.
  • Recall is approximately 6%
  • It is better at predicting true negative cases than positive.

Random Forest Classifier¶

In [109]:
# Choose the type of classifier. 
rf_estimator_tuned = RandomForestClassifier(random_state=1)

# Grid of parameters to choose from
## add from article
parameters = {"n_estimators": [15,26,5],
    "min_samples_leaf": np.arange(5, 10),
    "max_features": ['sqrt', 'log2'],
    "max_samples": np.arange(5, 10, 5),
             }

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
rf_estimator_tuned.fit(X_train, y_train)
Out[109]:
RandomForestClassifier(max_features='sqrt', max_samples=5, min_samples_leaf=5,
                       n_estimators=15, random_state=1)
In [110]:
make_confusion_matrix(rf_estimator_tuned,y_test)
In [111]:
#Using above defined function to get accuracy, recall and precision on train and test set
rf_estimator_tuned_score=get_metrics_score(rf_estimator_tuned)
Accuracy on training set :  0.901051811453058
Accuracy on test set :  0.9014167650531287
Recall on training set :  0.0
Recall on test set :  0.0
Precision on training set :  0.0
Precision on test set :  0.0

Observation:¶

Although this model has is not overfitting, it has poor precision and recall

Using class_weights for random forest¶

In [112]:
rf_wt = RandomForestClassifier(class_weight={0:0.4,1:0.6}, random_state=1)
rf_wt.fit(X_train,y_train)
Out[112]:
RandomForestClassifier(class_weight={0: 0.4, 1: 0.6}, random_state=1)
In [113]:
confusion_matrix_sklearn(rf_wt, X_test,y_test)
In [114]:
rf_wt_model_train_perf=model_performance_classification_sklearn(rf_wt, X_train,y_train)
print("Training performance \n",rf_wt_model_train_perf)
Training performance 
    Accuracy  Recall  Precision    F1
0     0.917   0.185      0.904 0.307
In [115]:
rf_wt_model_test_perf=model_performance_classification_sklearn(rf_wt, X_test,y_test)
print("Testing performance \n",rf_wt_model_test_perf)
Testing performance 
    Accuracy  Recall  Precision    F1
0     0.904   0.048      0.667 0.089

Observation¶

Precision and recall have improved in the weighted class model

Importance of features¶

In [116]:
importances = rf_wt.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

Observation:¶

The top 5 features in the class-weights random forest model are:

  • Patient_age_quantile
  • Leukocytes
  • Patient_admitted_to_regular_ward_1=yes_0=no
  • Hematocrit
  • Mean_platelet_volume

Boosting Models¶

The model will be boosted with:

  1. Adaboost
  2. Gradient boosting classifier

1. AdaBoost¶

In [117]:
abc = AdaBoostClassifier(random_state=1)
abc.fit(X_train,y_train)
Out[117]:
AdaBoostClassifier(random_state=1)
In [118]:
make_confusion_matrix(abc,y_test)
In [119]:
#Code to determine accuracy, recall and precision on train and test set
abc_score=get_metrics_score(abc)
Accuracy on training set :  0.9139072847682119
Accuracy on test set :  0.9031877213695395
Recall on training set :  0.14173228346456693
Recall on test set :  0.0718562874251497
Precision on training set :  0.9230769230769231
Precision on test set :  0.5714285714285714

2. Gradient Boosting Classifier¶

In [120]:
gbc = GradientBoostingClassifier(random_state=1)
gbc.fit(X_train,y_train)
Out[120]:
GradientBoostingClassifier(random_state=1)
In [121]:
make_confusion_matrix(gbc,y_test)
In [122]:
#Determining accuracy, recall and precision on train and test set
gbc_score=get_metrics_score(gbc)
Accuracy on training set :  0.9170237631476431
Accuracy on test set :  0.9014167650531287
Recall on training set :  0.16535433070866143
Recall on test set :  0.07784431137724551
Precision on training set :  0.9767441860465116
Precision on test set :  0.5

Observations:¶

  • Both boosting models perform well in both traing and test sets under default parameters.
  • Recall has improved significantly in both boosting models.
  • Precision has also improved although not ideal.

Hyperparameter Tuning¶

1. AdaBoost Classifier¶

In [123]:
# Choose the type of classifier. 
abc_tuned = AdaBoostClassifier(random_state=1)

# Grid of parameters to choose from
## add from article
parameters = {
    #Let's try different max_depth for base_estimator
    "base_estimator":[DecisionTreeClassifier(max_depth=1, random_state=1),DecisionTreeClassifier(max_depth=2, random_state=1),DecisionTreeClassifier(max_depth=3, random_state=1)],
    "n_estimators": np.arange(15,26,5),
    "learning_rate":np.arange(0.1,2,0.1)
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)
Out[123]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.8, n_estimators=15, random_state=1)
In [124]:
make_confusion_matrix(abc_tuned,y_test)
In [125]:
#Using above defined function to get accuracy, recall and precision on train and test set
abc_tuned_score=get_metrics_score(abc_tuned)
Accuracy on training set :  0.9170237631476431
Accuracy on test set :  0.9020070838252656
Recall on training set :  0.16535433070866143
Recall on test set :  0.04790419161676647
Precision on training set :  0.9767441860465116
Precision on test set :  0.5333333333333333

2. Gradient Boosting Classifier (Hyperparameter tuning)¶

In [126]:
#Using AdaBoost classifier as the estimator for initial predictions
gbc_init = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
gbc_init.fit(X_train,y_train)
Out[126]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           random_state=1)
In [127]:
gbc_init_score=get_metrics_score(gbc_init)
Accuracy on training set :  0.9166342033502143
Accuracy on test set :  0.9020070838252656
Recall on training set :  0.16141732283464566
Recall on test set :  0.07784431137724551
Precision on training set :  0.9761904761904762
Precision on test set :  0.52
In [128]:
# Choose the type of classifier. 
gbc_tuned = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)

# Grid of parameters to choose from
## add from article
parameters = {
    "n_estimators": [15,26,5],
    "subsample":[0.8,0.9,1],
    "max_features":[0.7,0.8,0.9,1]
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)
Out[128]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.9, n_estimators=26, random_state=1,
                           subsample=1)
In [129]:
make_confusion_matrix(gbc_tuned,y_test)
In [130]:
#Accuracy, recall and precision on train and test set
gbc_tuned_score=get_metrics_score(gbc_tuned)
Accuracy on training set :  0.9123490455784963
Accuracy on test set :  0.9025974025974026
Recall on training set :  0.11811023622047244
Recall on test set :  0.0718562874251497
Precision on training set :  0.967741935483871
Precision on test set :  0.5454545454545454

Observations:¶

  • After hyperparameter tuning, the performance metrics of the models have not changed significantly.
  • The models perform wellon both training and test sets.
  • Tuned Adaboost has a relatively better recall than tuned gradient boosting classifier

Feature ranking¶

In [131]:
importances = gbc_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

Observation:¶

The top 5 features after hyperparameter tuning are as follows;

  • Leukocytes
  • Patient_admitted_to_regular_ward_1=yes_0=no
  • Patient_age_quantile
  • Lymphocytes
  • Red_blood_cells

Comparing bagging models¶

In [132]:
# defining list of models
models = [bagging_estimator,bagging_estimator_tuned,bagging_lr,rf_estimator,rf_estimator_tuned,
          rf_wt]

# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []

# looping through all the models to get the accuracy, precall and precision scores
for model in models:
    j = get_metrics_score(model,False)
    acc_train.append(np.round(j[0],2))
    acc_test.append(np.round(j[1],2))
    recall_train.append(np.round(j[2],2))
    recall_test.append(np.round(j[3],2))
    precision_train.append(np.round(j[4],2))
    precision_test.append(np.round(j[5],2))
In [133]:
comparison_frame = pd.DataFrame({'Model':['Bagging classifier with default parameters','Tuned Bagging Classifier',
                                        'Bagging classifier with base_estimator=LR', 'Random Forest with deafult parameters',
                                         'Tuned Random Forest Classifier','Random Forest with class_weights'], 
                                          'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
                                          'Train_Recall':recall_train,'Test_Recall':recall_test,
                                          'Train_Precision':precision_train,'Test_Precision':precision_test}) 
comparison_frame
Out[133]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision
0 Bagging classifier with default parameters 0.920 0.900 0.170 0.080 0.960 0.520
1 Tuned Bagging Classifier 0.920 0.900 0.170 0.070 1.000 0.520
2 Bagging classifier with base_estimator=LR 0.900 0.900 0.050 0.060 0.760 0.620
3 Random Forest with deafult parameters 0.920 0.910 0.180 0.060 0.960 0.830
4 Tuned Random Forest Classifier 0.900 0.900 0.000 0.000 0.000 0.000
5 Random Forest with class_weights 0.920 0.900 0.190 0.050 0.900 0.670

Comparing boosting models¶

In [134]:
# defining list of models
models = [abc, abc_tuned, gbc, gbc_init, gbc_tuned]

# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []

# looping through all the models to get the accuracy, precall and precision scores
for model in models:
    j = get_metrics_score(model,False)
    acc_train.append(np.round(j[0],2))
    acc_test.append(np.round(j[1],2))
    recall_train.append(np.round(j[2],2))
    recall_test.append(np.round(j[3],2))
    precision_train.append(np.round(j[4],2))
    precision_test.append(np.round(j[5],2))
In [135]:
comparison_frame = pd.DataFrame({'Model':['AdaBoost with default paramters','AdaBoost Tuned', 
                                          'Gradient Boosting with default parameters','Gradient Boosting with init=AdaBoost',
                                          'Gradient Boosting Tuned'], 
                                          'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
                                          'Train_Recall':recall_train,'Test_Recall':recall_test,
                                          'Train_Precision':precision_train,'Test_Precision':precision_test}) 
comparison_frame
Out[135]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision
0 AdaBoost with default paramters 0.910 0.900 0.140 0.070 0.920 0.570
1 AdaBoost Tuned 0.920 0.900 0.170 0.050 0.980 0.530
2 Gradient Boosting with default parameters 0.920 0.900 0.170 0.080 0.980 0.500
3 Gradient Boosting with init=AdaBoost 0.920 0.900 0.160 0.080 0.980 0.520
4 Gradient Boosting Tuned 0.910 0.900 0.120 0.070 0.970 0.550

Observations:¶

  • All decision tree and random forest models did not overfit.
  • Although the Recall values in the boosted models were slightly better, Recall was generally poor in all decision tree and random forest models.

Conclusion¶

  • The decision tree and random forest models as they are, cannot be used to effectively predict Covid-19 positive patients nor enable confirmatory tests to be carried out accurately.
  • The models will be optimized to suit the oversampled/ undersampled nature of the dataset

Model Building with Oversampled data¶

In [136]:
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
In [137]:
print("Before OverSampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, count of label '0': {} \n".format(sum(y_train == 0)))

print("After OverSampling, count of label '1': {}".format(sum(y_train_over == 1)))
print("After OverSampling, count of label '0': {} \n".format(sum(y_train_over == 0)))

print("After OverSampling, the shape of train_X: {}".format(X_train_over.shape))
print("After OverSampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before OverSampling, count of label '1': 254
Before OverSampling, count of label '0': 2313 

After OverSampling, count of label '1': 2313
After OverSampling, count of label '0': 2313 

After OverSampling, the shape of train_X: (4626, 51)
After OverSampling, the shape of train_y: (4626,) 

In [138]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")

for name, model in models:
    scoring = "recall"
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold
    )
    results.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean() * 100))

print("\n" "Training Performance:" "\n")

for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores = recall_score(y_train_over, model.predict(X_train_over)) * 100
    print("{}: {}".format(name, scores))
Cross-Validation Performance:

Bagging: 85.82059409273232
Random forest: 87.72114854188288
GBM: 89.5360578945892
Adaboost: 86.42226024515442
dtree: 85.69053696483502

Training Performance:

Bagging: 88.49978383052313
Random forest: 89.06182447038478
GBM: 92.95287505404237
Adaboost: 87.8945092952875
dtree: 88.84565499351491
In [139]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results)
ax.set_xticklabels(names)

plt.show()

Observation¶

The gradient boosting model (GBM)is the best performing model for oversampled data.

Model Building with Undersampled data¶

In [140]:
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
In [141]:
print("Before Under Sampling, count of label '1': {}".format(sum(y_train == 1)))
print("Before Under Sampling, count of label '0': {} \n".format(sum(y_train == 0)))

print("After Under Sampling, count of label '1': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, count of label '0': {} \n".format(sum(y_train_un == 0)))

print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, count of label '1': 254
Before Under Sampling, count of label '0': 2313 

After Under Sampling, count of label '1': 254
After Under Sampling, count of label '0': 254 

After Under Sampling, the shape of train_X: (508, 51)
After Under Sampling, the shape of train_y: (508,) 

In [142]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")

for name, model in models:
    scoring = "recall"
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
    )
    results.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean() * 100))

print("\n" "Training Performance:" "\n")

for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores = recall_score(y_train_un, model.predict(X_train_un)) * 100
    print("{}: {}".format(name, scores))
Cross-Validation Performance:

Bagging: 70.47058823529412
Random forest: 69.65490196078431
GBM: 76.7686274509804
Adaboost: 79.12941176470589
dtree: 65.31764705882352

Training Performance:

Bagging: 75.59055118110236
Random forest: 80.70866141732283
GBM: 85.43307086614173
Adaboost: 92.91338582677166
dtree: 77.55905511811024
In [143]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results)
ax.set_xticklabels(names)

plt.show()

Observation:¶

Adaboost is the best performing model for the undersampled data

Sample tuning method for Decision tree with oversampled data¶

In [144]:
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
In [145]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "max_depth": np.arange(2, 6),
    "min_samples_leaf": [1, 4, 7],
    "max_leaf_nodes": [10, 15],
    "min_impurity_decrease": [0.0001, 0.001],
}

# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model,
    param_distributions=param_grid,
    n_iter=10,
    n_jobs=-1,
    scoring=scorer,
    cv=5,
    random_state=1,
)

# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)

print(
    "Best parameters are {} with CV score={}:".format(
        randomized_cv.best_params_, randomized_cv.best_score_
    )
)
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 10, 'max_depth': 2} with CV score=0.9982712032388058:
In [146]:
# Set the clf to the best combination of parameters
dt_tuned = DecisionTreeClassifier(
    max_depth=4, min_samples_leaf=1, max_leaf_nodes=15, min_impurity_decrease=0.001,
)

# Fit the best algorithm to the data.
dt_tuned.fit(X_train_over, y_train_over)
Out[146]:
DecisionTreeClassifier(max_depth=4, max_leaf_nodes=15,
                       min_impurity_decrease=0.001)
In [147]:
# creating confusion matrix
confusion_matrix_sklearn(dt_tuned, X_train_over, y_train_over)
In [148]:
# Calculating different metrics on train set
dt_random_train = model_performance_classification_sklearn(
    dt_tuned, X_train_over, y_train_over
)
print("Training performance:")
dt_random_train
Training performance:
Out[148]:
Accuracy Recall Precision F1
0 0.628 0.961 0.577 0.721

Observation:¶

Recall has improved in this model.

Tuning Random Forest with Randomized Search (Oversampled data)¶

In [149]:
%%time

# Choose the type of classifier. 
rf2 = RandomForestClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {"n_estimators": [200,250,300],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], 
    "max_samples": np.arange(0.4, 0.7, 0.1),
    "max_depth":np.arange(3,4,5),
    "class_weight" : ['balanced', 'balanced_subsample'],
    "min_impurity_decrease":[0.001, 0.002, 0.003]
             }

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Run the random search
grid_obj = RandomizedSearchCV(rf2, parameters,n_iter=30, scoring=acc_scorer,cv=5, random_state = 1, n_jobs = -1, verbose = 2)
# using n_iter = 30, so randomized search will try 30 different combinations of hyperparameters
# by default, n_iter = 10

grid_obj = grid_obj.fit(X_train_over, y_train_over)

# Print the best combination of parameters
grid_obj.best_params_
Fitting 5 folds for each of 30 candidates, totalling 150 fits
CPU times: user 1.55 s, sys: 82.1 ms, total: 1.63 s
Wall time: 34.9 s
Out[149]:
{'n_estimators': 250,
 'min_samples_leaf': 1,
 'min_impurity_decrease': 0.001,
 'max_samples': 0.6,
 'max_features': 'sqrt',
 'max_depth': 3,
 'class_weight': 'balanced_subsample'}
In [150]:
# Set the clf to the best combination of parameters
rf2_tuned = RandomForestClassifier(
    class_weight="balanced",
    max_features="sqrt",
    max_samples=0.5,
    min_samples_leaf=2,
    n_estimators=200,
    random_state=1,
    max_depth=3,
    min_impurity_decrease=0.001,
)

# Fit the best algorithm to the data.
rf2_tuned.fit(X_train_over, y_train_over)
Out[150]:
RandomForestClassifier(class_weight='balanced', max_depth=3,
                       max_features='sqrt', max_samples=0.5,
                       min_impurity_decrease=0.001, min_samples_leaf=2,
                       n_estimators=200, random_state=1)
In [151]:
# creating confusion matrix
confusion_matrix_sklearn(rf2_tuned, X_train_over, y_train_over)
In [152]:
# Calculating different metrics on train set
rf2_random_train = model_performance_classification_sklearn(
    rf2_tuned, X_train_over, y_train_over
)
print("Training performance:")
rf2_random_train
Training performance:
Out[152]:
Accuracy Recall Precision F1
0 0.649 0.889 0.601 0.717

Observation:¶

Recall on the training set has improved for random forest

Tuning Adaboost with Randomized Search (Oversampled data)¶

In [153]:
%%time 

# defining model
model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in GridSearchCV

param_grid = {
    "n_estimators": np.arange(100, 150, 200),
    "learning_rate": [0.2, 0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 100, 'learning_rate': 0.05, 'base_estimator': DecisionTreeClassifier(max_depth=1, random_state=1)} with CV score=0.9719072863781287:
CPU times: user 632 ms, sys: 24.6 ms, total: 656 ms
Wall time: 11.9 s
In [154]:
# building model with best parameters
adb_tuned2 = AdaBoostClassifier(
    n_estimators=100,
    learning_rate=0.05,
    random_state=1,
    base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)

# Fit the model on training data
adb_tuned2.fit(X_train_over, y_train_over)
Out[154]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.05, n_estimators=100, random_state=1)
In [155]:
# creating confusion matrix
confusion_matrix_sklearn(adb_tuned2,  X_train_over, y_train_over)
In [156]:
# Calculating different metrics on train set
Adaboost_random_train = model_performance_classification_sklearn(
    adb_tuned2, X_train_over, y_train_over
)
print("Training performance:")
Adaboost_random_train
Training performance:
Out[156]:
Accuracy Recall Precision F1
0 0.704 0.899 0.647 0.752

Observation¶

Recall has also improved in this model

Tuning GBM with Gridsearch (Oversampled data)¶

In [157]:
# Choose the type of classifier. 
gbc_tuned_1= GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)

# Grid of parameters to choose from
## add from article
parameters = {
    "n_estimators": [15,26,5],
    "subsample":[0.8,0.9,1],
    "max_features":[0.7,0.8,0.9,1]
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(gbc_tuned_1, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train_over, y_train_over)

# Set the clf to the best combination of parameters
gbc_tuned_1 = grid_obj.best_estimator_

# Fit the best algorithm to the data.
gbc_tuned_1.fit(X_train_over, y_train_over)
Out[157]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.8, n_estimators=5, random_state=1,
                           subsample=0.9)
In [158]:
# creating confusion matrix
confusion_matrix_sklearn(gbc_tuned_1, X_train_over, y_train_over)
In [159]:
gbc_random_train1 = model_performance_classification_sklearn(
    gbc_tuned_1, X_train_over, y_train_over
)
print("Training performance:")
gbc_random_train1
Training performance:
Out[159]:
Accuracy Recall Precision F1
0 0.648 0.975 0.589 0.735

Sample tuning method for Decision tree with undersampled data¶

In [160]:
# Set the clf to the best combination of parameters
dt1_tuned = DecisionTreeClassifier(
    max_depth=6, min_samples_leaf=7, max_leaf_nodes=15, min_impurity_decrease=0.001,
)

# Fit the best algorithm to the data.
dt1_tuned.fit(X_train_un, y_train_un)
Out[160]:
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=15,
                       min_impurity_decrease=0.001, min_samples_leaf=7)
In [161]:
# creating confusion matrix
confusion_matrix_sklearn(dt1_tuned, X_train_un, y_train_un)
In [162]:
# Calculating different metrics on validation set
dt1_random_train = model_performance_classification_sklearn(dt1_tuned, X_train_un, y_train_un)
print("Training performance:")
dt1_random_train
Training performance:
Out[162]:
Accuracy Recall Precision F1
0 0.634 0.874 0.590 0.705

Tuning Random forests with Randomized Search (Undersampled data)¶

In [163]:
%%time

# Choose the type of classifier. 
rf2 = RandomForestClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {"n_estimators": [200,250,300],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], 
    "max_samples": np.arange(0.4, 0.7, 0.1),
    "max_depth":np.arange(3,4,5),
    "class_weight" : ['balanced', 'balanced_subsample'],
    "min_impurity_decrease":[0.001, 0.002, 0.003]
             }

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Run the random search
grid_obj = RandomizedSearchCV(rf2, parameters,n_iter=30, scoring=acc_scorer,cv=5, random_state = 1, n_jobs = -1, verbose = 2)
# using n_iter = 30, so randomized search will try 30 different combinations of hyperparameters
# by default, n_iter = 10

grid_obj = grid_obj.fit(X_train_un, y_train_un)

# Print the best combination of parameters
grid_obj.best_params_
Fitting 5 folds for each of 30 candidates, totalling 150 fits
CPU times: user 746 ms, sys: 44.7 ms, total: 790 ms
Wall time: 27.8 s
Out[163]:
{'n_estimators': 200,
 'min_samples_leaf': 2,
 'min_impurity_decrease': 0.001,
 'max_samples': 0.5,
 'max_features': 'sqrt',
 'max_depth': 3,
 'class_weight': 'balanced'}
In [164]:
# Set the clf to the best combination of parameters
rf2_tuned = RandomForestClassifier(
    class_weight="balanced",
    max_features="sqrt",
    max_samples=0.6,
    min_samples_leaf=1,
    n_estimators=200,
    random_state=1,
    max_depth=3,
    min_impurity_decrease=0.001,
)

# Fit the best algorithm to the data.
rf2_tuned.fit(X_train_un, y_train_un)
Out[164]:
RandomForestClassifier(class_weight='balanced', max_depth=3,
                       max_features='sqrt', max_samples=0.6,
                       min_impurity_decrease=0.001, n_estimators=200,
                       random_state=1)
In [165]:
# creating confusion matrix
confusion_matrix_sklearn(rf2_tuned, X_train_un, y_train_un)
In [166]:
# Calculating different metrics on train set
rf2_random_train = model_performance_classification_sklearn(
    rf2_tuned, X_train_un, y_train_un
)
print("Training performance:")
rf2_random_train
Training performance:
Out[166]:
Accuracy Recall Precision F1
0 0.646 0.858 0.602 0.708

Tuning Adaboost with Randomized Search (Undersampled data)¶

In [167]:
%%time 

# defining model
model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in GridSearchCV

param_grid = {
    "n_estimators": np.arange(100, 150, 200),
    "learning_rate": [0.2, 0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 100, 'learning_rate': 0.05, 'base_estimator': DecisionTreeClassifier(max_depth=1, random_state=1)} with CV score=0.8936470588235295:
CPU times: user 258 ms, sys: 11.5 ms, total: 270 ms
Wall time: 3.49 s
In [168]:
# building model with best parameters
adb_tuned2 = AdaBoostClassifier(
    n_estimators=100,
    learning_rate=0.2,
    random_state=1,
    base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)

# Fit the model on training data
adb_tuned2.fit(X_train_un, y_train_un)
Out[168]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.2, n_estimators=100, random_state=1)
In [169]:
# creating confusion matrix
confusion_matrix_sklearn(adb_tuned2, X_train_un, y_train_un)
In [170]:
Adaboost_random_train = model_performance_classification_sklearn(
    adb_tuned2, X_train_un, y_train_un
)
print("Training performance:")
Adaboost_random_train
Training performance:
Out[170]:
Accuracy Recall Precision F1
0 0.742 0.819 0.710 0.761

Observation:¶

The models have a comparatively better recall and precision for undersampled data than with the original data.

Tuning GBM with Gridsearch (Undersampled data)¶

In [171]:
# Choose the type of classifier. 
gbc_tuned_2= GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)

# Grid of parameters to choose from
## add from article
parameters = {
    "n_estimators": [15,26,5],
    "subsample":[0.8,0.9,1],
    "max_features":[0.7,0.8,0.9,1]
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(gbc_tuned_1, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train_un, y_train_un)

# Set the clf to the best combination of parameters
gbc_tuned_2 = grid_obj.best_estimator_

# Fit the best algorithm to the data.
gbc_tuned_2.fit(X_train_un, y_train_un)
Out[171]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.7, n_estimators=5, random_state=1,
                           subsample=0.9)
In [172]:
# creating confusion matrix
confusion_matrix_sklearn(gbc_tuned_2, X_train_un, y_train_un)
In [173]:
gbc_random_train2 = model_performance_classification_sklearn(
    gbc_tuned_2, X_train_un, y_train_un
)
print("Training performance:")
gbc_random_train2
Training performance:
Out[173]:
Accuracy Recall Precision F1
0 0.661 0.933 0.605 0.734

Observation:¶

The models have a comparatively better recall and precision for undersampled data than with the original data.

Model performance comparison for oversampled and undersampled data training set¶

In [174]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        dt_random_train.T,
        rf2_random_train.T,
        Adaboost_random_train.T,
        dt1_random_train.T,
        rf2_random_train.T,
        Adaboost_random_train.T,
        gbc_random_train1.T,
        gbc_random_train2.T,
        
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Tuned DTree oversampled",
    "Random forest Oversampled",
    "AdaBoost Tuned with Random search",
    "Tuned DTree undersampled",
    "Random forest undersampled",
    "Adaboost tuned with Random Search undersampled",
    "GBM tuned with oversampled data",
    "GBM tuned with undersampled data"
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[174]:
Tuned DTree oversampled Random forest Oversampled AdaBoost Tuned with Random search Tuned DTree undersampled Random forest undersampled Adaboost tuned with Random Search undersampled GBM tuned with oversampled data GBM tuned with undersampled data
Accuracy 0.628 0.646 0.742 0.634 0.646 0.742 0.648 0.661
Recall 0.961 0.858 0.819 0.874 0.858 0.819 0.975 0.933
Precision 0.577 0.602 0.710 0.590 0.602 0.710 0.589 0.605
F1 0.721 0.708 0.761 0.705 0.708 0.761 0.735 0.734

Observation:¶

The tuned gradient boosting model has the best recall for both oversampled and undersampled training sets

Model Performance on Validation set¶

In [175]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))

results = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")

for name, model in models:
    scoring = "recall"
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold
    )
    results.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean() * 100))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores = recall_score(y_val, model.predict(X_val)) * 100
    print("{}: {}".format(name, scores))
Cross-Validation Performance:

Logistic regression: 63.37254901960784
Bagging: 70.47058823529412
Random forest: 69.65490196078431
GBM: 76.7686274509804
Adaboost: 79.12941176470589
dtree: 65.31764705882352

Validation Performance:

Logistic regression: 63.503649635036496
Bagging: 62.04379562043796
Random forest: 66.42335766423358
GBM: 73.72262773722628
Adaboost: 86.86131386861314
dtree: 56.934306569343065

Performance of Tuned Models on Validation Set¶

In [176]:
# Calculating different metrics on validation set
dt1_random_val = model_performance_classification_sklearn(dt1_tuned, X_val, y_val)
print("Validation performance:")
dt1_random_val
Validation performance:
Out[176]:
Accuracy Recall Precision F1
0 0.408 0.832 0.125 0.218
In [177]:
# Calculating different metrics on validation set
rf2_random_val = model_performance_classification_sklearn(
    rf2_tuned, X_val, y_val
)
print("Validation performance:")
rf2_random_val
Validation performance:
Out[177]:
Accuracy Recall Precision F1
0 0.415 0.810 0.124 0.215
In [178]:
# Calculating different metrics on validation set
Adaboost_random_val = model_performance_classification_sklearn(
    adb_tuned2, X_val, y_val
)
print("Validation performance:")
Adaboost_random_val
Validation performance:
Out[178]:
Accuracy Recall Precision F1
0 0.524 0.635 0.125 0.209
In [179]:
gbc_random_val = model_performance_classification_sklearn(
    gbc_tuned_2, X_val, y_val
)
print("Validation performance:")
gbc_random_val
Validation performance:
Out[179]:
Accuracy Recall Precision F1
0 0.390 0.905 0.130 0.227
In [180]:
# validation performance comparison

models_val_comp_df = pd.concat(
    [
        dt1_random_val.T,
        rf2_random_val.T,
        Adaboost_random_val.T,
        gbc_random_val.T,
        
    ],
    axis=1,
)
models_val_comp_df.columns = [
    "Tuned DTree val",
    "Random forest val",
    "AdaBoost Tuned with Random search val",
    "GBM tuned val"
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
Out[180]:
Tuned DTree val Random forest val AdaBoost Tuned with Random search val GBM tuned val
Accuracy 0.408 0.415 0.524 0.390
Recall 0.832 0.810 0.635 0.905
Precision 0.125 0.124 0.125 0.130
F1 0.218 0.215 0.209 0.227

Comment¶

  • The Tuned Gradient Boosted Modelhas the best recall (90.5%) performance on the validation set
  • The tuned Adaboost model has the worst recall (63.5%).

Test set final performance¶

In [181]:
# Calculating different metrics on test set
dt1_random_test = model_performance_classification_sklearn(dt1_tuned, X_test, y_test)
print("Test performance:")
dt1_random_test
Test performance:
Out[181]:
Accuracy Recall Precision F1
0 0.402 0.868 0.128 0.223
In [182]:
# Calculating different metrics on test set
rf2_random_test = model_performance_classification_sklearn(
    rf2_tuned, X_test, y_test
)
print("Test performance:")
rf2_random_test
Test performance:
Out[182]:
Accuracy Recall Precision F1
0 0.416 0.826 0.126 0.218
In [183]:
# Calculating different metrics on test set
Adaboost_random_test = model_performance_classification_sklearn(
    adb_tuned2, X_test, y_test
)
print("Test performance:")
Adaboost_random_test
Test performance:
Out[183]:
Accuracy Recall Precision F1
0 0.537 0.677 0.134 0.224
In [184]:
gbc_random_test = model_performance_classification_sklearn(
    gbc_tuned_2, X_test, y_test
)
print("Test performance:")
gbc_random_test
Test performance:
Out[184]:
Accuracy Recall Precision F1
0 0.384 0.856 0.123 0.215
In [185]:
# test performance comparison

models_test_comp_df = pd.concat(
    [
        dt1_random_test.T,
        rf2_random_test.T,
        Adaboost_random_test.T,
        gbc_random_test.T,
        
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Tuned DTree test",
    "Random forest test",
    "AdaBoost Tuned with Random search test",
    "GBM tuned test"
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
Out[185]:
Tuned DTree test Random forest test AdaBoost Tuned with Random search test GBM tuned test
Accuracy 0.402 0.416 0.537 0.384
Recall 0.868 0.826 0.677 0.856
Precision 0.128 0.126 0.134 0.123
F1 0.223 0.218 0.224 0.215

Observation:¶

  • The model is slightly overfitting.
  • Accuracy and precision are low for all 3 models but recall is good.
  • The tuned decision tree model is preferred because it has the highest recall (86.8%) in the test set.
  • The tuned GBM has the second best recall metric (85.6%)
  • The ability of a screening algorithm to pick up positive Covid-19 cases is an invaluable and hence Recall is a desired property of the final model.
In [186]:
feature_names = X.columns
importances = dt1_tuned.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   1.1s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   0.9s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   0.9s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   1.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   1.6s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time=   2.1s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   0.8s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time=   0.8s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time=   0.9s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time=   1.7s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.3s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.6s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   1.9s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time=   1.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time=   1.4s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time=   1.5s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.3s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   1.4s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time=   1.8s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time=   1.6s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.9s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   1.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   1.9s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   1.9s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   1.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   0.6s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   0.8s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time=   0.7s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time=   0.8s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time=   1.1s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   1.3s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   0.6s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   0.6s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time=   1.0s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   1.1s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   0.9s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   1.4s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   1.7s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time=   2.0s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   0.8s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time=   0.8s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time=   1.7s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time=   1.7s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   2.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time=   1.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time=   1.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   1.3s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time=   1.4s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   1.0s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   0.8s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.3s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   1.3s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   1.6s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time=   1.9s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.9s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   1.5s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   2.4s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   1.6s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time=   1.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   0.6s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time=   0.8s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   1.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   0.4s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   0.7s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time=   1.0s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   1.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   0.9s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   1.5s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time=   2.1s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time=   1.9s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   0.9s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time=   0.9s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time=   1.7s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   2.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   1.8s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   1.3s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time=   1.6s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   0.9s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.3s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   1.4s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time=   1.8s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.5s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.9s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   1.4s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   2.4s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   1.6s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   1.1s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   0.7s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time=   0.8s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time=   0.7s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time=   1.1s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time=   0.9s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time=   1.0s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time=   0.8s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   1.1s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   1.1s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   0.9s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   1.7s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time=   2.0s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   0.8s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   0.9s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time=   0.8s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time=   1.7s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   2.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time=   1.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time=   1.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time=   1.4s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   1.0s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   0.9s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.3s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   1.4s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time=   1.9s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.4s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.9s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.9s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   1.5s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   2.4s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   1.6s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   1.1s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=200; total time=   0.6s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=200; total time=   0.8s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=2, n_estimators=300; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.001, min_samples_leaf=3, n_estimators=300; total time=   1.3s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time=   0.8s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=200; total time=   1.1s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=1, n_estimators=200; total time=   0.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=3, n_estimators=300; total time=   1.2s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=300; total time=   1.1s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   0.7s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.003, min_samples_leaf=3, n_estimators=200; total time=   0.6s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced, max_depth=3, max_features=sqrt, max_samples=0.5, min_impurity_decrease=0.002, min_samples_leaf=2, n_estimators=300; total time=   1.0s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.6, min_impurity_decrease=0.003, min_samples_leaf=2, n_estimators=250; total time=   0.2s
[CV] END class_weight=balanced_subsample, max_depth=3, max_features=sqrt, max_samples=0.6, min_impurity_decrease=0.001, min_samples_leaf=1, n_estimators=250; total time=   1.0s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.3s
[CV] END class_weight=balanced, max_depth=3, max_features=[0.3 0.4 0.5], max_samples=0.4, min_impurity_decrease=0.002, min_samples_leaf=1, n_estimators=300; total time=   0.2s

Interpretation of the best model¶

  • As mentioned earlier, the ideal model should be able to identify positive Covid-19 cases when people with flu-like symptomes present at a hospital.
  • The metric that best captures this property of any analytic model is Recall
  • The tuned decision tree model has a recall of 0.87.
  • This means that it will be able to correctly predict positive cases 87% of the time.
  • It also means it will be be able to predict 87% of positive Covid-19 cases.
  • For the purposes of our project, this makes it the best model.

Business Insights And Recommendations¶

  1. There are few positive Covid-19 cases and a large number of negative cases in the dataset. This can affect modeling when splitting of datasets into training and test sets is done randomly.Stratification of the split sets and optimization of models aid in acheiving good performance metrics.
  2. Oversampling and undersampling techniques are employed to enable adequate model training.
  3. Factors or variables such as the patient's age, leukocyte count, whether or not they are admitted into the ward and hematocrit are the most important in predicting positive Covid-19 cases.
  4. The institution can therefore effectively manage the burden of extensive Covid-19 testing in under-resourced hospitals by assessing and investing in the above mentioned variables in patients who present with flu-like illnesses.
  5. The tuned decision tree model is the best predictive model with the highest sensitivity or recall to enable effective screening of Covid-19 cases. It is able to correctly predict positive Covid-19 cases about 87% of the time

Executive Summary¶

Key takeways:¶

  • The dataset, in its raw form, needed to undergo processing to enable comprehensive predictive models to be built.
  • The dataset was fairly large, consisting of data from 5644 unique patients.
  • Working with this data required a great deal of meticulousness and diligence.
  • Being a fairly new disease, Covid-19 is not yet fully understood and predicting by using patient parameters other than PCR testing is no mean task.
  • By employing machine learning techniques, a predictive model has been identified to work as a screening to assist the organization effectively allocate its resources in fighting the Covid-19 pandemic.

Final model selection:¶

  • The final model was chosen after rigorously, evaluating the. dataset, going through data pre-processing and subsequently building, training and testing several models.
  • The model with the highest recall was finally selected to serve the purpose of a sensitive screening tool.
  • Metrics such as accuracy and precision were also evaluated for each model.
  • Expectations of the final model:
  1. It should have a good recall- the ability to accurately predict positive cases.
  2. It should generalize well on both train and test sets.

Next Steps¶

  • The Tuned Decision Tree Model can be used by the organization in predicting positive Covid-19 cases in its patient population.
  • Top features or parameters that can influence the predictions are:
  1. Patient age
  2. Leukocytes
  3. Ward admission
  4. Hematocrit
  • Resources can therefore be allocated to ensure that these parameters are measured in suspected Covid-19 cases.

Problems and Solution Summary¶

Problems:¶

  1. Hospitals such as the Hospital Israelita Albert Einstein of Sao Paolo, Brazil are overwhelmed and under-resourced in identifying positive Covid-19 cases in patients with flu-like symptoms

  2. Patients present with a multitude of signs and symptoms that make it difficult to isolate Covid-19 cases

  3. Patient data is numerous, varied and contains errors and missing data.

  1. Choosing the best predictive model out of the many machine learning models

Solution:¶

  1. Develop a predictive analytical model that can serve as a sensitive screening tool

  2. Identify most important parameters that can be used to influence positive predictions.

  3. Data cleaning and other pre-processing techniques enable salient and useful data to be picked up from the dataset

  4. Performance metrics will be used to evaluate various analytical models and the model with the best recall will be selected.

Is the model a valid solution to the problem?¶

  • ANSWER: YES
  • Having such a good recall means it will be able to correctly predict positive cases 87% of the time.
  • It also means it will be be able to predict 87% of positive Covid-19 cases.

Recommendations on implementation¶

  • The hospital management should redistribute resources to ensure that the key features or parameters that influence the prediction of positive Covid-19 cases are captured.

  • Patient age and type of admission must be recorded at all times.

  • In the absence of PCR machines and other direct modalities for covid testing, the necessary equipment and reagents needed to test for leukocytes, hematocrit, red blood cell parameters and RSV, Rhinovirus_Enterovirus infections must be supplied to the frontline areas of the hospital where patients are first encountered.

Cost-Benefit Analysis¶

Benefits:

  • Early identification and management of Covid-19 cases.
  • Prevention of cross-infection and spread of the disease among patients and staff
  • Protection of hospital’s workforce from infection and in doing so, maintaining productivity.
  • Adequate allocation of hospital’s resources in managing Covid-19 infections.

Cost:

  • More staff required for data capture , laboratory testing and implementation of models.
  • Regular education and capacity-building exercises on identification of predictive parameters and the use of the predictive model.
  • Building a standardized protocol for the use of the model in the entire hospital will be labor-intensive.

Risks and Challenges¶

  • Poor data capture necessary for model prediction.
  • Inadequate technical know-how by staff.
  • Insufficient workforce to implement data collection, testing and implementation of the model

Conclusion¶

  • The hospital management is prudent in using predictive machine learning models in making business intelligence decisions that will save lives, increase productivity and enable the delivery of value-based healthcare.

  • The tuned decision tree model will be an effective screening tool for the identification of positive Covid-19 cases among patients with flu-like symptoms.

  • It will complement hospitals in areas where it is impossible or impractical to test everyone for Covid-19 infection